Tidymodels é um conjunto de bibliotecas que cuida de todos os passos necessários para desenvolver o workflow de seleção e avaliação de modelos de aprendizado estatístico.
O desenvolvimento é financiado pela RStudio e liderado por Max Kuhn, o principal desenvolvedor de uma biblioteca similar mais antiga: caret.
A tidymodels é toda tidy friendly. Essa é uma das diferenças em relação à caret. Ela também é mais completa e possui muito mais funcionalidades.
É possível obter mais informações em tidymodels.org
Tidymodels é formada por pacotes ortogonais.
Este termo é emprestado da matemática. No caso de dois vetores ortogonais, podemos nos mover na direção de um deles sem que nossa projeção no outro seja alterada.
Em programação ou arquitetura de software dizemos que componentes ortogonais são desacoplados: a mudança em um componente não afeta outros. Esta propriedade exige componentes menores e mais coesos, com responsabilidades bem definidas, e permite alterações com menos efeitos colaterais. Um bom livro para quem quer entender como usar conceitos como esse em programação se chama Pragmatic Programmer, de David Thomas e Andrew Hunt.
As bibliotecas que compõem a tidymodels funcionam assim: ao configurar o workflow que vai implementar o processo de treinamento, seleção e avaliação de modelos, várias etapas ortogonais vão ser preparadas com uso de várias bibliotecas.
workflows ajuda a montar todas as etapas do processo de trabalho em uma estrutura de fluxo de trabalho;
recipes permite criar as etapas de pré-processamento, facilitando o processo de feature engineering e sua aplicação a dados fora da amostra;
rsample ajuda a dividir dados em treinamento, teste e validação e provê a infraestrutura de amostragem para realizar processos de cross-validation;
parnsnip contém interfaces genéricas para vários tipos de modelos de aprendizado estatístico;
tune ajuda a criar executar a busca pelo melhor conjunto de hiperparâmetros para um modelo;
dials ajuda a definir valores candidatos para os hiperparâmetros;
yardsticks provê as funcionalidades necessárias para medir a performance dos modelos.
Os dados vieram de um estudo de pesquisadores da Columbia Business School, Ray Fisman and Sheena Iyenga.
Eles fizeram várias rodadas de encontros de 4 minutos entre homens e mulheres heterossexuais.
Várias características foram coletadas, incluindo um veredito final determinando se cada parceiro de encinto gostou do outro.
Os dados foram coletados no site Kaggle
Eles não estão redondos…
## Rows: 8,378
## Columns: 195
## $ iid <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ gender <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ idg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ condtn <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ round <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10...
## $ position <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ positin1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ order <dbl> 4, 3, 10, 5, 7, 6, 1, 2, 8, 9, 10, 9, 6, 1, 3, 2, 7, 8, 4,...
## $ partner <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...
## $ pid <dbl> 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 11, 12, 13, 14, 15...
## $ match <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ int_corr <dbl> 0.14, 0.54, 0.16, 0.61, 0.21, 0.25, 0.34, 0.50, 0.28, -0.3...
## $ samerace <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1...
## $ age_o <dbl> 27, 22, 22, 23, 24, 25, 30, 27, 28, 24, 27, 22, 22, 23, 24...
## $ race_o <dbl> 2, 2, 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 4, 2, 3, 2, 2, 2, 2, 2...
## $ pf_o_att <dbl> 35.00, 60.00, 19.00, 30.00, 30.00, 50.00, 35.00, 33.33, 50...
## $ pf_o_sin <dbl> 20.00, 0.00, 18.00, 5.00, 10.00, 0.00, 15.00, 11.11, 0.00,...
## $ pf_o_int <dbl> 20.00, 0.00, 19.00, 15.00, 20.00, 30.00, 25.00, 11.11, 25....
## $ pf_o_fun <dbl> 20.00, 40.00, 18.00, 40.00, 10.00, 10.00, 10.00, 11.11, 10...
## $ pf_o_amb <dbl> 0.00, 0.00, 14.00, 5.00, 10.00, 0.00, 5.00, 11.11, 0.00, 0...
## $ pf_o_sha <dbl> 5.00, 0.00, 12.00, 5.00, 20.00, 10.00, 10.00, 22.22, 15.00...
## $ dec_o <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0...
## $ attr_o <dbl> 6, 7, 10, 7, 8, 7, 3, 6, 7, 6, 8, 7, 10, 9, 10, 7, 5, 7, 8...
## $ sinc_o <dbl> 8, 8, 10, 8, 7, 7, 6, 7, 7, 6, 7, 6, 10, 9, 10, 8, 3, 7, 6...
## $ intel_o <dbl> 8, 10, 10, 9, 9, 8, 7, 5, 8, 6, 6, 10, 10, 9, 10, 7, 4, 7,...
## $ fun_o <dbl> 8, 7, 10, 8, 6, 8, 5, 6, 8, 6, 9, 6, 10, 9, 10, 5, 3, 7, 9...
## $ amb_o <dbl> 8, 7, 10, 9, 9, 7, 8, 8, 8, 6, 7, 6, 10, 9, 7, 7, 5, 7, 8,...
## $ shar_o <dbl> 6, 5, 10, 8, 7, 7, 7, 6, 9, 6, 4, 5, 10, 9, 8, 7, 3, 5, 7,...
## $ like_o <dbl> 7.0, 8.0, 10.0, 7.0, 8.0, 7.0, 2.0, 7.0, 6.5, 6.0, 7.0, 8....
## $ prob_o <dbl> 4, 4, 10, 7, 6, 6, 1, 5, 8, 6, 2, 4, 10, 7, 1, 5, 3, 6, 8,...
## $ met_o <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2...
## $ age <dbl> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 24, 24, 24, 24, 24...
## $ field <chr> "Law", "Law", "Law", "Law", "Law", "Law", "Law", "Law", "L...
## $ field_cd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ mn_sat <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tuition <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ race <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprace <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ imprelig <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ from <chr> "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Ch...
## $ zipcode <dbl> 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60521, 60...
## $ income <dbl> 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69487, 69...
## $ goal <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ go_out <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ career <chr> "lawyer", "lawyer", "lawyer", "lawyer", "lawyer", "lawyer"...
## $ career_c <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sports <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ dining <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, ...
## $ museums <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ art <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hiking <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ gaming <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ reading <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 10, 10, 10, 10, 10, 10, 10, ...
## $ tv <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ theater <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ movies <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 8, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 7, 7, 7, 7, 7, 7, ...
## $ music <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ yoga <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ exphappy <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ expnum <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ attr1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 45, 45, 45, 45, 45...
## $ sinc1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ fun1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20...
## $ amb1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 5, 5, 5, 5, 5, ...
## $ attr4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_1 <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 65, 65, 65, 65, 65...
## $ sinc2_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 0, 0, 0, 0, 0, 0, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ fun2_1 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25...
## $ amb2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ shar2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ attr3_1 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ fun3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ amb3_1 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_1 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ dec <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1...
## $ attr <dbl> 6, 7, 5, 7, 5, 4, 7, 4, 7, 5, 5, 8, 5, 7, 6, 8, 7, 5, 7, 6...
## $ sinc <dbl> 9, 8, 8, 6, 6, 9, 6, 9, 6, 6, 7, 5, 8, 9, 8, 7, 5, 8, 6, 7...
## $ intel <dbl> 7, 7, 9, 8, 7, 7, 7, 7, 8, 6, 8, 6, 9, 7, 7, 8, 9, 7, 8, 8...
## $ fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, 9, 8, 4, 6, 6, 6, 9, 3, 6, 5, 9, 7...
## $ amb <dbl> 6, 5, 5, 6, 6, 6, 6, 5, 8, 10, 6, 9, 3, 5, 7, 6, 7, 9, 4, ...
## $ shar <dbl> 5, 6, 7, 8, 6, 4, 7, 6, 8, 8, 3, 6, 4, 7, 8, 2, 9, 5, 5, 8...
## $ like <dbl> 7, 7, 7, 7, 6, 6, 6, 6, 7, 6, 6, 7, 6, 7, 8, 6, 8, 5, 5, 8...
## $ prob <dbl> 6, 5, NA, 6, 6, 5, 5, 7, 7, 6, 4, 3, 7, 8, 6, 5, 7, 6, 6, ...
## $ met <dbl> 2, 1, 1, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ match_es <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ attr1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar1_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb3_s <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ satis_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ length <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ numdat_2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, ...
## $ attr7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_2 <dbl> 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19.44, 19...
## $ sinc1_2 <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13.89, 13...
## $ fun1_2 <dbl> 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22.22, 22...
## $ amb1_2 <dbl> 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11.11, 11...
## $ shar1_2 <dbl> 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16.67, 16...
## $ attr4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
## $ fun3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ you_call <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ them_cal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ date_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30...
## $ sinc1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 5, 5, 5, 5, 5, 5, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 40, 40, 40, 40, 40...
## $ fun1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15...
## $ amb1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, ...
## $ shar1_3 <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10...
## $ attr7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar7_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar4_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ shar2_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ attr3_3 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7...
## $ sinc3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ fun3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9...
## $ amb3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ attr5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ sinc5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fun5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ amb5_3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
Algumas colunas devem ser renomeadas para nomes mais inteligíveis
dados_speed_date_renomeado <- dados_speed_date %>%
rename(
unique_id_number = iid,
id_within_wave = id,
male = gender,
subject_within_gender = idg,
choice = condtn,
n_people_met_in_wave = round,
position_meeting = position,
position_started = positin1,
order_meeting = order,
partnet_id_within_wave = partner,
partner_unique_id_number =pid ,
interests_correlation = int_corr,
same_race = samerace,
my_age = age,
partner_age = age_o,
partner_race = race_o,
partner_stated_pref_time0_attractive = pf_o_att,
partner_stated_pref_time0_sincere = pf_o_sin,
partner_stated_pref_time0_intelligent = pf_o_int,
partner_stated_pref_time0_fun = pf_o_fun,
partner_stated_pref_time0_ambitious = pf_o_amb,
partner_stated_pref_time0_shared_interests = pf_o_sha,
cod_field = field_cd,
importance_same_race = imprace,
importance_same_religion = imprelig,
place_from = from,
zipcode = zipcode,
income_zipcode = income,
frequency_date = date,
frequency_go_out = go_out,
career_macro = career_c,
happy_expec = exphappy,
n_expect_like_you = expnum,
i_liked_partner = dec,
partner_liked_me = dec_o,
i_found_partner__attractive = attr,
i_found_partner__sincere = sinc,
i_found_partner__intelligent = intel,
i_found_partner__fun = fun,
i_found_partner__ambitious = amb,
i_found_partner__interests = shar,
degree_i_liked_partner = like,
partner_found_me__attractive = attr_o,
partner_found_me__sincere = sinc_o,
partner_found_me__intelligent = intel_o,
partner_found_me__fun = fun_o,
partner_found_me__ambitious = amb_o,
partner_found_me__interests = shar_o,
probability_i_find_partner_liked_me = prob,
met_before = met,
n_matches_you_think = match_es,
satisfaction_with_partners = satis_2,
opinion_duration_of_date = length,
opinion_num_dates = numdat_2,
num_matches_you_called = you_call,
num_matches_called_you = them_cal,
have_you_dated = date_3
)
glimpse(dados_speed_date_renomeado)## Rows: 8,378
## Columns: 195
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ male <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ match <dbl> 0, 0, 1, 1, 1, 0, 0, 0, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <dbl> 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <dbl> 2, 2, 4, 2, 3, 2, 2, 2, ...
## $ partner_stated_pref_time0_attractive <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ partner_liked_me <dbl> 0, 0, 1, 1, 1, 1, 0, 0, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ like_o <dbl> 7.0, 8.0, 10.0, 7.0, 8.0...
## $ prob_o <dbl> 4, 4, 10, 7, 6, 6, 1, 5,...
## $ met_o <dbl> 2, 2, 1, 2, 2, 2, 2, 2, ...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ field <chr> "Law", "Law", "Law", "La...
## $ cod_field <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ undergra <lgl> NA, NA, NA, NA, NA, NA, ...
## $ mn_sat <lgl> NA, NA, NA, NA, NA, NA, ...
## $ tuition <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ place_from <chr> "Chicago", "Chicago", "C...
## $ zipcode <dbl> 60521, 60521, 60521, 605...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ goal <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ frequency_date <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ frequency_go_out <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ career <chr> "lawyer", "lawyer", "law...
## $ career_macro <dbl> NA, NA, NA, NA, NA, NA, ...
## $ sports <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ tvsports <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ exercise <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ dining <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ museums <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ art <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ hiking <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ gaming <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ clubbing <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ reading <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ tv <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ theater <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ movies <dbl> 10, 10, 10, 10, 10, 10, ...
## $ concerts <dbl> 10, 10, 10, 10, 10, 10, ...
## $ music <dbl> 9, 9, 9, 9, 9, 9, 9, 9, ...
## $ shopping <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ yoga <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ attr1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_1 <dbl> 35, 35, 35, 35, 35, 35, ...
## $ sinc2_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel2_1 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ fun2_1 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ amb2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ shar2_1 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ attr3_1 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ intel3_1 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ amb3_1 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_1 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ i_liked_partner <dbl> 1, 1, 1, 1, 1, 0, 1, 0, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ degree_i_liked_partner <dbl> 7, 7, 7, 7, 6, 6, 6, 6, ...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <dbl> 2, 1, 1, 2, 2, 2, 2, NA,...
## $ n_matches_you_think <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ attr1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar1_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb3_s <lgl> NA, NA, NA, NA, NA, NA, ...
## $ satisfaction_with_partners <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ opinion_duration_of_date <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ opinion_num_dates <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ attr7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_2 <dbl> 19.44, 19.44, 19.44, 19....
## $ sinc1_2 <dbl> 16.67, 16.67, 16.67, 16....
## $ intel1_2 <dbl> 13.89, 13.89, 13.89, 13....
## $ fun1_2 <dbl> 22.22, 22.22, 22.22, 22....
## $ amb1_2 <dbl> 11.11, 11.11, 11.11, 11....
## $ shar1_2 <dbl> 16.67, 16.67, 16.67, 16....
## $ attr4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ sinc3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_2 <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ fun3_2 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_2 <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ attr5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_2 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_matches_you_called <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ num_matches_called_you <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ have_you_dated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ numdat_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ num_in_3 <dbl> NA, NA, NA, NA, NA, NA, ...
## $ attr1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ sinc1_3 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ intel1_3 <dbl> 20, 20, 20, 20, 20, 20, ...
## $ fun1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ amb1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ shar1_3 <dbl> 15, 15, 15, 15, 15, 15, ...
## $ attr7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar7_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar4_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ shar2_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ attr3_3 <dbl> 5, 5, 5, 5, 5, 5, 5, 5, ...
## $ sinc3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ intel3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ fun3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ amb3_3 <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ attr5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ sinc5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ intel5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ fun5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
## $ amb5_3 <lgl> NA, NA, NA, NA, NA, NA, ...
Ainda há colunas com sufixos misteriosos, como 1_1
adjust_column_feature <- function(x, suffix, meaning ){
suffix_removed <- str_remove(string = x, pattern = suffix)
type <- case_when(
suffix_removed == "attr" ~ "attractive",
suffix_removed == "sinc" ~ "sincere",
suffix_removed == "intel" ~ "intelligent",
suffix_removed == "fun" ~ "fun",
suffix_removed == "amb" ~ "ambitious",
suffix_removed == "shar" ~ "shared_interests"
)
str_glue("{meaning}_{type}")
}
dados_speed_date_rename_with <- dados_speed_date_renomeado %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_1", meaning = "competitors_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_2", meaning = "competitors_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)4_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "4_3", meaning = "competitors_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_1", meaning = "you_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_s"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_s", meaning = "you_look_for_half_way_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_2", meaning = "you_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)1_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "1_3", meaning = "you_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_1", meaning = "you_think_opposite_sex_look_for_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_2", meaning = "you_think_opposite_sex_look_for_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)2_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "2_3", meaning = "you_think_opposite_sex_look_for_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_1", meaning = "others_perceive_you_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_2", meaning = "others_perceive_you_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)5_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "5_3", meaning = "others_perceive_you_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_1"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_1", meaning = "you_perceive_yourself_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_2", meaning = "you_perceive_yourself_follow_up_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_s"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_s", meaning = "you_perceive_yourself_half_way_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)3_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "3_3", meaning = "you_perceive_yourself_follow_up_weeks_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_2"),
.fn = ~adjust_column_feature(x = .x, suffix = "7_2", meaning = "actual_importance_")
) %>%
rename_with(
.cols = matches("^(?:attr|sinc|intel|fun|amb|shar)7_3"),
.fn = ~adjust_column_feature(x = .x, suffix = "7_3", meaning = "actual_importance_follow_up_weeks_")
) %>%
select(
-c(
undergra,
mn_sat,
tuition
)
)
glimpse(dados_speed_date_rename_with)## Rows: 8,378
## Columns: 192
## $ unique_id_number <dbl> 1...
## $ id_within_wave <dbl> 1...
## $ male <dbl> 0...
## $ subject_within_gender <dbl> 1...
## $ choice <dbl> 1...
## $ wave <dbl> 1...
## $ n_people_met_in_wave <dbl> 1...
## $ position_meeting <dbl> 7...
## $ position_started <lgl> N...
## $ order_meeting <dbl> 4...
## $ partnet_id_within_wave <dbl> 1...
## $ partner_unique_id_number <dbl> 1...
## $ match <dbl> 0...
## $ interests_correlation <dbl> 0...
## $ same_race <dbl> 0...
## $ partner_age <dbl> 2...
## $ partner_race <dbl> 2...
## $ partner_stated_pref_time0_attractive <dbl> 3...
## $ partner_stated_pref_time0_sincere <dbl> 2...
## $ partner_stated_pref_time0_intelligent <dbl> 2...
## $ partner_stated_pref_time0_fun <dbl> 2...
## $ partner_stated_pref_time0_ambitious <dbl> 0...
## $ partner_stated_pref_time0_shared_interests <dbl> 5...
## $ partner_liked_me <dbl> 0...
## $ partner_found_me__attractive <dbl> 6...
## $ partner_found_me__sincere <dbl> 8...
## $ partner_found_me__intelligent <dbl> 8...
## $ partner_found_me__fun <dbl> 8...
## $ partner_found_me__ambitious <dbl> 8...
## $ partner_found_me__interests <dbl> 6...
## $ like_o <dbl> 7...
## $ prob_o <dbl> 4...
## $ met_o <dbl> 2...
## $ my_age <dbl> 2...
## $ field <chr> "...
## $ cod_field <dbl> 1...
## $ race <dbl> 4...
## $ importance_same_race <dbl> 2...
## $ importance_same_religion <dbl> 4...
## $ place_from <chr> "...
## $ zipcode <dbl> 6...
## $ income_zipcode <dbl> 6...
## $ goal <dbl> 2...
## $ frequency_date <dbl> 7...
## $ frequency_go_out <dbl> 1...
## $ career <chr> "...
## $ career_macro <dbl> N...
## $ sports <dbl> 9...
## $ tvsports <dbl> 2...
## $ exercise <dbl> 8...
## $ dining <dbl> 9...
## $ museums <dbl> 1...
## $ art <dbl> 1...
## $ hiking <dbl> 5...
## $ gaming <dbl> 1...
## $ clubbing <dbl> 5...
## $ reading <dbl> 6...
## $ tv <dbl> 9...
## $ theater <dbl> 1...
## $ movies <dbl> 1...
## $ concerts <dbl> 1...
## $ music <dbl> 9...
## $ shopping <dbl> 8...
## $ yoga <dbl> 1...
## $ happy_expec <dbl> 3...
## $ n_expect_like_you <dbl> 2...
## $ you_look_for__attractive <dbl> 1...
## $ you_look_for__sincere <dbl> 2...
## $ you_look_for__intelligent <dbl> 2...
## $ you_look_for__fun <dbl> 1...
## $ you_look_for__ambitious <dbl> 1...
## $ you_look_for__shared_interests <dbl> 1...
## $ competitors_look_for__attractive <lgl> N...
## $ competitors_look_for__sincere <lgl> N...
## $ competitors_look_for__intelligent <lgl> N...
## $ competitors_look_for__fun <lgl> N...
## $ competitors_look_for__ambitious <lgl> N...
## $ competitors_look_for__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for__attractive <dbl> 3...
## $ you_think_opposite_sex_look_for__sincere <dbl> 2...
## $ you_think_opposite_sex_look_for__intelligent <dbl> 1...
## $ you_think_opposite_sex_look_for__fun <dbl> 2...
## $ you_think_opposite_sex_look_for__ambitious <dbl> 5...
## $ you_think_opposite_sex_look_for__shared_interests <dbl> 5...
## $ you_perceive_yourself__attractive <dbl> 6...
## $ you_perceive_yourself__sincere <dbl> 8...
## $ you_perceive_yourself__fun <dbl> 8...
## $ you_perceive_yourself__intelligent <dbl> 8...
## $ you_perceive_yourself__ambitious <dbl> 7...
## $ others_perceive_you__attractive <lgl> N...
## $ others_perceive_you__sincere <lgl> N...
## $ others_perceive_you__intelligent <lgl> N...
## $ others_perceive_you__fun <lgl> N...
## $ others_perceive_you__ambitious <lgl> N...
## $ i_liked_partner <dbl> 1...
## $ i_found_partner__attractive <dbl> 6...
## $ i_found_partner__sincere <dbl> 9...
## $ i_found_partner__intelligent <dbl> 7...
## $ i_found_partner__fun <dbl> 7...
## $ i_found_partner__ambitious <dbl> 6...
## $ i_found_partner__interests <dbl> 5...
## $ degree_i_liked_partner <dbl> 7...
## $ probability_i_find_partner_liked_me <dbl> 6...
## $ met_before <dbl> 2...
## $ n_matches_you_think <dbl> 4...
## $ you_look_for_half_way__attractive <lgl> N...
## $ you_look_for_half_way__sincere <lgl> N...
## $ you_look_for_half_way__intelligent <lgl> N...
## $ you_look_for_half_way__fun <lgl> N...
## $ you_look_for_half_way__ambitious <lgl> N...
## $ you_look_for_half_way__shared_interests <lgl> N...
## $ you_perceive_yourself_half_way__attractive <lgl> N...
## $ you_perceive_yourself_half_way__sincere <lgl> N...
## $ you_perceive_yourself_half_way__intelligent <lgl> N...
## $ you_perceive_yourself_half_way__fun <lgl> N...
## $ you_perceive_yourself_half_way__ambitious <lgl> N...
## $ satisfaction_with_partners <dbl> 6...
## $ opinion_duration_of_date <dbl> 2...
## $ opinion_num_dates <dbl> 1...
## $ actual_importance__attractive <lgl> N...
## $ actual_importance__sincere <lgl> N...
## $ actual_importance__intelligent <lgl> N...
## $ actual_importance__fun <lgl> N...
## $ actual_importance__ambitious <lgl> N...
## $ actual_importance__shared_interests <lgl> N...
## $ you_look_for_follow_up__attractive <dbl> 1...
## $ you_look_for_follow_up__sincere <dbl> 1...
## $ you_look_for_follow_up__intelligent <dbl> 1...
## $ you_look_for_follow_up__fun <dbl> 2...
## $ you_look_for_follow_up__ambitious <dbl> 1...
## $ you_look_for_follow_up__shared_interests <dbl> 1...
## $ competitors_look_for_follow_up__attractive <lgl> N...
## $ competitors_look_for_follow_up__sincere <lgl> N...
## $ competitors_look_for_follow_up__intelligent <lgl> N...
## $ competitors_look_for_follow_up__fun <lgl> N...
## $ competitors_look_for_follow_up__ambitious <lgl> N...
## $ competitors_look_for_follow_up__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__attractive <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__sincere <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__intelligent <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__fun <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__ambitious <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up__attractive <dbl> 6...
## $ you_perceive_yourself_follow_up__sincere <dbl> 7...
## $ you_perceive_yourself_follow_up__intelligent <dbl> 8...
## $ you_perceive_yourself_follow_up__fun <dbl> 7...
## $ you_perceive_yourself_follow_up__ambitious <dbl> 6...
## $ others_perceive_you_follow_up__attractive <lgl> N...
## $ others_perceive_you_follow_up__sincere <lgl> N...
## $ others_perceive_you_follow_up__intelligent <lgl> N...
## $ others_perceive_you_follow_up__fun <lgl> N...
## $ others_perceive_you_follow_up__ambitious <lgl> N...
## $ num_matches_you_called <dbl> 1...
## $ num_matches_called_you <dbl> 1...
## $ have_you_dated <dbl> 0...
## $ numdat_3 <lgl> N...
## $ num_in_3 <dbl> N...
## $ you_look_for_follow_up_weeks__attractive <dbl> 1...
## $ you_look_for_follow_up_weeks__sincere <dbl> 2...
## $ you_look_for_follow_up_weeks__intelligent <dbl> 2...
## $ you_look_for_follow_up_weeks__fun <dbl> 1...
## $ you_look_for_follow_up_weeks__ambitious <dbl> 1...
## $ you_look_for_follow_up_weeks__shared_interests <dbl> 1...
## $ actual_importance_follow_up_weeks__attractive <lgl> N...
## $ actual_importance_follow_up_weeks__sincere <lgl> N...
## $ actual_importance_follow_up_weeks__intelligent <lgl> N...
## $ actual_importance_follow_up_weeks__fun <lgl> N...
## $ actual_importance_follow_up_weeks__ambitious <lgl> N...
## $ actual_importance_follow_up_weeks__shared_interests <lgl> N...
## $ competitors_look_for_follow_up_weeks__attractive <lgl> N...
## $ competitors_look_for_follow_up_weeks__sincere <lgl> N...
## $ competitors_look_for_follow_up_weeks__intelligent <lgl> N...
## $ competitors_look_for_follow_up_weeks__fun <lgl> N...
## $ competitors_look_for_follow_up_weeks__ambitious <lgl> N...
## $ competitors_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__attractive <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__sincere <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__intelligent <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__fun <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__ambitious <lgl> N...
## $ you_think_opposite_sex_look_for_follow_up_weeks__shared_interests <lgl> N...
## $ you_perceive_yourself_follow_up_weeks__attractive <dbl> 5...
## $ you_perceive_yourself_follow_up_weeks__sincere <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__intelligent <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__fun <dbl> 7...
## $ you_perceive_yourself_follow_up_weeks__ambitious <dbl> 7...
## $ others_perceive_you_follow_up_weeks__attractive <lgl> N...
## $ others_perceive_you_follow_up_weeks__sincere <lgl> N...
## $ others_perceive_you_follow_up_weeks__intelligent <lgl> N...
## $ others_perceive_you_follow_up_weeks__fun <lgl> N...
## $ others_perceive_you_follow_up_weeks__ambitious <lgl> N...
Muitos atributos estão codificados numericamente, o que atrapalha a interpretação, eles foram transformados em vetores de caracteres.
Essa codificação numérica é muito comum em produtos de análise estatística de prateleira, que possibilitam point-and-click.
Reparem que o atributo frequency_date foi transformado em um fator onde os levels tem uma ordem espcífica. Isso trará implicações posteriores.
dados_speed_date_fatores <- dados_speed_date_rename_with %>% mutate(
choice = if_else(choice == 1, "limited", "extensive") ,
field_factor = case_when(
cod_field == 1 ~ "Law",
cod_field == 2 ~ "Math",
cod_field == 3 ~ "Social Science, Psychologist" ,
cod_field == 4 ~ "Medical Science, Pharmaceuticals, and Bio Tech" ,
cod_field == 5 ~ "Engineering" ,
cod_field == 6 ~ "English/Creative Writing/ Journalism" ,
cod_field == 7 ~ "History/Religion/Philosophy" ,
cod_field == 8 ~ "Business/Econ/Finance" ,
cod_field == 9 ~ "Education, Academia" ,
cod_field == 10 ~ "Biological Sciences/Chemistry/Physics",
cod_field == 11 ~ "Social Work" ,
cod_field == 12 ~ "Undergrad/undecided" ,
cod_field == 13 ~ "Political Science/International Affairs" ,
cod_field == 14 ~ "Film",
cod_field == 15 ~ "Fine Arts/Arts Administration",
cod_field == 16 ~ "Languages",
cod_field == 17 ~ "Architecture",
cod_field == 18 ~ "Other"
),
race = case_when(
race == 1 ~ "Black",
race == 2 ~ "White",
race == 3 ~ "Latino",
race == 4 ~ "Asian" ,
race == 5 ~ "Native American" ,
race == 6 ~ "Others"
),
partner_race = case_when(
partner_race == 1 ~ "Black",
partner_race == 2 ~ "White",
partner_race == 3 ~ "Latino",
partner_race == 4 ~ "Asian" ,
partner_race == 5 ~ "Native American" ,
partner_race == 6 ~ "Others"
),
goal = case_when(
goal == 1 ~ "Fun",
goal == 2 ~ "Meet new people",
goal == 3 ~ "Date",
goal == 4 ~ "Serious",
goal == 5 ~ "To say",
goal == 6 ~ "Other"
),
cod_frequency_date = frequency_date
,
frequency_date =
case_when(
frequency_date == 1 ~ "Several a week",
frequency_date == 2 ~ "Twice a week",
frequency_date == 3 ~ "Once a week",
frequency_date == 4 ~ "Twice a month",
frequency_date == 5 ~ "Once a month",
frequency_date == 6 ~ "Several a year",
frequency_date == 7 ~ "Never"
) %>%
factor(
level = c(
"Several a week",
"Twice a week",
"Once a week",
"Twice a month",
"Once a month",
"Several a year",
"Never"
),
ordered = TRUE
)
,
frequency_go_out =
case_when(
frequency_go_out == 1 ~ "Several a week",
frequency_go_out == 2 ~ "Twice a week",
frequency_date == 3 ~ "Once a week",
frequency_date == 4 ~ "Twice a month",
frequency_date == 5 ~ "Once a month",
frequency_date == 6 ~ "Several a year",
frequency_date == 7 ~ "Never"
) %>%
factor(
level = c(
"Several a week",
"Twice a week",
"Once a week",
"Twice a month",
"Once a month",
"Several a year",
"Never"
),
ordered = TRUE
) ,
career = str_to_title(career),
career_macro =
case_when(
career_macro == 1 ~ "Lawyer",
career_macro == 2 ~ "Academic/Research",
career_macro == 3 ~ "Psychologist" ,
career_macro == 4 ~ "Doctor/Medicine" ,
career_macro == 5 ~ "Engineer" ,
career_macro == 6 ~ "Creative Arts/Entertainment" ,
career_macro == 7 ~ "Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin" ,
career_macro == 8 ~ "Real Estate" ,
career_macro == 9 ~ "International/Humanitarian Affairs" ,
career_macro == 10 ~ "Undecided" ,
career_macro == 11 ~ "Social Work",
career_macro == 12 ~ "Speech Pathology",
career_macro == 13 ~ "Politics",
career_macro == 14 ~ "Pro sports/Athletics",
career_macro == 15 ~ "Other",
career_macro == 16 ~ "Journalism",
career_macro == 17 ~ "Architecture"
),
met_before = if_else(met_before == 1, TRUE, FALSE),
opinion_duration_of_date = case_when(
opinion_duration_of_date == 1 ~ "Too little",
opinion_duration_of_date == 2 ~ "Too much",
opinion_duration_of_date == 3 ~ "Just Right",
),
opinion_num_dates = case_when(
opinion_num_dates == 1 ~ "Too few",
opinion_num_dates == 2 ~ "Too many"
),
have_you_dated = case_when(
have_you_dated == 1 ~ TRUE,
have_you_dated == 2 ~ FALSE
)
,
sex = if_else(male > 0, "Homem", "Mulher") %>% as_factor()
) %>%
select(
match,
unique_id_number,
id_within_wave,
sex,
subject_within_gender,
choice,
n_people_met_in_wave,
position_meeting,
position_started,
order_meeting,
partnet_id_within_wave,
partner_unique_id_number,
interests_correlation,
same_race,
my_age,
partner_age,
partner_race,
partner_stated_pref_time0_attractive,
partner_stated_pref_time0_sincere,
partner_stated_pref_time0_intelligent,
partner_stated_pref_time0_fun,
partner_stated_pref_time0_ambitious,
partner_stated_pref_time0_shared_interests,
importance_same_race,
importance_same_religion,
income_zipcode,
frequency_date,
frequency_go_out,
career_macro,
happy_expec,
n_expect_like_you,
partner_liked_me,
i_liked_partner,
i_found_partner__attractive,
i_found_partner__sincere,
i_found_partner__intelligent,
i_found_partner__fun,
i_found_partner__ambitious,
i_found_partner__interests,
partner_found_me__attractive,
partner_found_me__sincere,
partner_found_me__intelligent,
partner_found_me__fun,
partner_found_me__ambitious,
partner_found_me__interests,
probability_i_find_partner_liked_me,
met_before,
opinion_duration_of_date,
opinion_num_dates,
starts_with("competitors_look_for__"),
starts_with("you_look_for__"),
starts_with("opposite_sex_look_for__"),
starts_with("others_perceive_you__"),
starts_with("you_perceive_yourself__"),
starts_with("actual_importance__"),
choice,
race,
goal,
frequency_date,
career_macro,
met_before,
opinion_duration_of_date,
opinion_num_dates,
) %>%
mutate(
across(
.cols = where(is.character),
.fns = as.factor
)
) %>%
mutate(
across(
.cols = c(match, same_race, partner_liked_me, i_liked_partner) ,
.fns = as.logical
)
)
glimpse(dados_speed_date_fatores)## Rows: 8,378
## Columns: 79
## $ match <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <fct> limited, limited, limite...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive <dbl> 35.00, 60.00, 19.00, 30....
## $ partner_stated_pref_time0_sincere <dbl> 20.00, 0.00, 18.00, 5.00...
## $ partner_stated_pref_time0_intelligent <dbl> 20.00, 0.00, 19.00, 15.0...
## $ partner_stated_pref_time0_fun <dbl> 20.00, 40.00, 18.00, 40....
## $ partner_stated_pref_time0_ambitious <dbl> 0.00, 0.00, 14.00, 5.00,...
## $ partner_stated_pref_time0_shared_interests <dbl> 5.00, 0.00, 12.00, 5.00,...
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ frequency_date <ord> Never, Never, Never, Nev...
## $ frequency_go_out <ord> Several a week, Several ...
## $ career_macro <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date <fct> Too much, Too much, Too ...
## $ opinion_num_dates <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__sincere <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__intelligent <dbl> 20, 20, 20, 20, 20, 20, ...
## $ you_look_for__fun <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__ambitious <dbl> 15, 15, 15, 15, 15, 15, ...
## $ you_look_for__shared_interests <dbl> 15, 15, 15, 15, 15, 15, ...
## $ others_perceive_you__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <fct> Asian, Asian, Asian, Asi...
## $ goal <fct> Meet new people, Meet ne...
Algumas perguntas foram feitas de forma inconsistente ao longo dos dias da pesquisa.
Em alguns dias foi dado um orçamento de x pontos para os entrevistados distribuírem nos atributos de mesmo tipo, em outros foi dado um orçamento pra cada atributo.
normaliza_no_prefixo <- function(
df = dados_com_representacao ,
prefixo = "partner_stated_pref_time0_" ){
dados_speed_date_normalizada <- df %>%
rowwise() %>%
mutate(
"{prefixo}_soma" :=
sum(c_across(starts_with(prefixo)), na.rm = TRUE)
) %>%
mutate(
across(
.cols = starts_with(prefixo),
.fns = ~.x / .data[[str_glue("{prefixo}_soma")]]
)
) %>%
select(
-contains(str_glue("{prefixo}_soma"))
)
}
dados_speed_date_normalizada <- dados_speed_date_fatores %>%
normaliza_no_prefixo("partner_stated_pref_time0_" ) %>%
normaliza_no_prefixo("you_look_for__" ) %>%
normaliza_no_prefixo("opposite_sex_look_for__" ) %>%
ungroup()
glimpse(dados_speed_date_normalizada)## Rows: 8,378
## Columns: 79
## $ match <lgl> FALSE, FALSE, TRUE, TRUE...
## $ unique_id_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ id_within_wave <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ subject_within_gender <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ choice <fct> limited, limited, limite...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ position_started <lgl> NA, NA, NA, NA, NA, NA, ...
## $ order_meeting <dbl> 4, 3, 10, 5, 7, 6, 1, 2,...
## $ partnet_id_within_wave <dbl> 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ partner_unique_id_number <dbl> 11, 12, 13, 14, 15, 16, ...
## $ interests_correlation <dbl> 0.14, 0.54, 0.16, 0.61, ...
## $ same_race <lgl> FALSE, FALSE, TRUE, FALS...
## $ my_age <dbl> 21, 21, 21, 21, 21, 21, ...
## $ partner_age <dbl> 27, 22, 22, 23, 24, 25, ...
## $ partner_race <fct> White, White, Asian, Whi...
## $ partner_stated_pref_time0_attractive <dbl> 0.3500000, 0.6000000, 0....
## $ partner_stated_pref_time0_sincere <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_intelligent <dbl> 0.2000000, 0.0000000, 0....
## $ partner_stated_pref_time0_fun <dbl> 0.2000000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious <dbl> 0.0000000, 0.0000000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.0500000, 0.0000000, 0....
## $ importance_same_race <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ importance_same_religion <dbl> 4, 4, 4, 4, 4, 4, 4, 4, ...
## $ income_zipcode <dbl> 69487, 69487, 69487, 694...
## $ frequency_date <ord> Never, Never, Never, Nev...
## $ frequency_go_out <ord> Several a week, Several ...
## $ career_macro <fct> NA, NA, NA, NA, NA, NA, ...
## $ happy_expec <dbl> 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ n_expect_like_you <dbl> 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ partner_liked_me <lgl> FALSE, FALSE, TRUE, TRUE...
## $ i_liked_partner <lgl> TRUE, TRUE, TRUE, TRUE, ...
## $ i_found_partner__attractive <dbl> 6, 7, 5, 7, 5, 4, 7, 4, ...
## $ i_found_partner__sincere <dbl> 9, 8, 8, 6, 6, 9, 6, 9, ...
## $ i_found_partner__intelligent <dbl> 7, 7, 9, 8, 7, 7, 7, 7, ...
## $ i_found_partner__fun <dbl> 7, 8, 8, 7, 7, 4, 4, 6, ...
## $ i_found_partner__ambitious <dbl> 6, 5, 5, 6, 6, 6, 6, 5, ...
## $ i_found_partner__interests <dbl> 5, 6, 7, 8, 6, 4, 7, 6, ...
## $ partner_found_me__attractive <dbl> 6, 7, 10, 7, 8, 7, 3, 6,...
## $ partner_found_me__sincere <dbl> 8, 8, 10, 8, 7, 7, 6, 7,...
## $ partner_found_me__intelligent <dbl> 8, 10, 10, 9, 9, 8, 7, 5...
## $ partner_found_me__fun <dbl> 8, 7, 10, 8, 6, 8, 5, 6,...
## $ partner_found_me__ambitious <dbl> 8, 7, 10, 9, 9, 7, 8, 8,...
## $ partner_found_me__interests <dbl> 6, 5, 10, 8, 7, 7, 7, 6,...
## $ probability_i_find_partner_liked_me <dbl> 6, 5, NA, 6, 6, 5, 5, 7,...
## $ met_before <lgl> FALSE, TRUE, TRUE, FALSE...
## $ opinion_duration_of_date <fct> Too much, Too much, Too ...
## $ opinion_num_dates <fct> Too few, Too few, Too fe...
## $ competitors_look_for__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ competitors_look_for__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_look_for__attractive <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__sincere <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__ambitious <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ you_look_for__shared_interests <dbl> 0.15, 0.15, 0.15, 0.15, ...
## $ others_perceive_you__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ others_perceive_you__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ you_perceive_yourself__attractive <dbl> 6, 6, 6, 6, 6, 6, 6, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__fun <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__intelligent <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ you_perceive_yourself__ambitious <dbl> 7, 7, 7, 7, 7, 7, 7, 7, ...
## $ actual_importance__attractive <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__sincere <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__intelligent <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__fun <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__ambitious <lgl> NA, NA, NA, NA, NA, NA, ...
## $ actual_importance__shared_interests <lgl> NA, NA, NA, NA, NA, NA, ...
## $ race <fct> Asian, Asian, Asian, Asi...
## $ goal <fct> Meet new people, Meet ne...
A biblioteca skim, com a função skimr(), oferece uma boa forma de ver um resumo com a característica dos dados
| Name | dados_speed_date_normaliz… |
| Number of rows | 8378 |
| Number of columns | 79 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| logical | 23 |
| numeric | 46 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1.00 | FALSE | 2 | Hom: 4194, Mul: 4184 |
| choice | 0 | 1.00 | FALSE | 2 | ext: 6944, lim: 1434 |
| partner_race | 73 | 0.99 | FALSE | 5 | Whi: 4722, Asi: 1978, Lat: 664, Oth: 521 |
| frequency_date | 97 | 0.99 | TRUE | 7 | Sev: 2094, Twi: 2040, Onc: 1528, Nev: 1434 |
| frequency_go_out | 2778 | 0.67 | TRUE | 2 | Twi: 2990, Sev: 2610, Onc: 0, Twi: 0 |
| career_macro | 138 | 0.98 | FALSE | 17 | Aca: 2320, Ban: 2170, Cre: 724, Law: 675 |
| opinion_duration_of_date | 915 | 0.89 | FALSE | 3 | Too: 4227, Jus: 3059, Too: 177 |
| opinion_num_dates | 4107 | 0.51 | FALSE | 2 | Too: 3622, Too: 649 |
| race | 63 | 0.99 | FALSE | 5 | Whi: 4727, Asi: 1982, Lat: 664, Oth: 522 |
| goal | 79 | 0.99 | FALSE | 6 | Fun: 3426, Mee: 3012, Dat: 631, To : 510 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| match | 0 | 1.00 | 0.16 | FAL: 6998, TRU: 1380 |
| position_started | 7974 | 0.05 | 1.00 | TRU: 404 |
| same_race | 0 | 1.00 | 0.40 | FAL: 5062, TRU: 3316 |
| partner_liked_me | 0 | 1.00 | 0.42 | FAL: 4863, TRU: 3515 |
| i_liked_partner | 0 | 1.00 | 0.42 | FAL: 4860, TRU: 3518 |
| met_before | 375 | 0.96 | 0.04 | FAL: 7652, TRU: 351 |
| competitors_look_for__attractive | 8378 | 0.00 | NaN | : |
| competitors_look_for__sincere | 7997 | 0.05 | 0.05 | FAL: 363, TRU: 18 |
| competitors_look_for__intelligent | 8204 | 0.02 | 0.28 | FAL: 125, TRU: 49 |
| competitors_look_for__fun | 8319 | 0.01 | 0.31 | FAL: 41, TRU: 18 |
| competitors_look_for__ambitious | 7693 | 0.08 | 0.18 | FAL: 563, TRU: 122 |
| competitors_look_for__shared_interests | 8059 | 0.04 | 0.15 | FAL: 271, TRU: 48 |
| others_perceive_you__attractive | 8378 | 0.00 | NaN | : |
| others_perceive_you__sincere | 8368 | 0.00 | 1.00 | TRU: 10 |
| others_perceive_you__intelligent | 8378 | 0.00 | NaN | : |
| others_perceive_you__fun | 8378 | 0.00 | NaN | : |
| others_perceive_you__ambitious | 8363 | 0.00 | 1.00 | TRU: 15 |
| actual_importance__attractive | 8378 | 0.00 | NaN | : |
| actual_importance__sincere | 8205 | 0.02 | 0.00 | FAL: 173 |
| actual_importance__intelligent | 8297 | 0.01 | 0.00 | FAL: 81 |
| actual_importance__fun | 8344 | 0.00 | 0.00 | FAL: 34 |
| actual_importance__ambitious | 7842 | 0.06 | 0.00 | FAL: 536 |
| actual_importance__shared_interests | 8165 | 0.03 | 0.00 | FAL: 213 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| unique_id_number | 0 | 1.00 | 283.68 | 158.58 | 1.00 | 154.00 | 281.00 | 407.00 | 552.00 | ▇▇▇▇▇ |
| id_within_wave | 1 | 1.00 | 8.96 | 5.49 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1.00 | 17.33 | 10.94 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1.00 | 16.87 | 4.36 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▃▂▅▇ |
| position_meeting | 0 | 1.00 | 9.04 | 5.51 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| order_meeting | 0 | 1.00 | 8.93 | 5.48 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1.00 | 8.96 | 5.49 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 10 | 1.00 | 283.86 | 158.58 | 1.00 | 154.00 | 281.00 | 408.00 | 552.00 | ▇▇▇▇▇ |
| interests_correlation | 158 | 0.98 | 0.20 | 0.30 | -0.83 | -0.02 | 0.21 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 95 | 0.99 | 26.36 | 3.57 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 104 | 0.99 | 26.36 | 3.56 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 89 | 0.99 | 0.22 | 0.13 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 89 | 0.99 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.60 | ▃▇▂▁▁ |
| partner_stated_pref_time0_intelligent | 89 | 0.99 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| partner_stated_pref_time0_fun | 98 | 0.99 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 107 | 0.99 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 129 | 0.98 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| importance_same_race | 79 | 0.99 | 3.78 | 2.85 | 0.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▂ |
| importance_same_religion | 79 | 0.99 | 3.65 | 2.81 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▁ |
| income_zipcode | 4099 | 0.51 | 44887.61 | 17206.92 | 8607.00 | 31516.00 | 43185.00 | 54303.00 | 109031.00 | ▃▇▅▂▁ |
| happy_expec | 101 | 0.99 | 5.53 | 1.73 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▃▁ |
| n_expect_like_you | 6578 | 0.21 | 5.57 | 4.76 | 0.00 | 2.00 | 4.00 | 8.00 | 20.00 | ▇▃▂▁▁ |
| i_found_partner__attractive | 202 | 0.98 | 6.19 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 277 | 0.97 | 7.18 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 296 | 0.96 | 7.37 | 1.55 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 350 | 0.96 | 6.40 | 1.95 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▂ |
| i_found_partner__ambitious | 712 | 0.92 | 6.78 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| i_found_partner__interests | 1067 | 0.87 | 5.47 | 2.16 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 212 | 0.97 | 6.19 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.50 | ▁▃▇▇▂ |
| partner_found_me__sincere | 287 | 0.97 | 7.18 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 306 | 0.96 | 7.37 | 1.55 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 360 | 0.96 | 6.40 | 1.95 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 722 | 0.91 | 6.78 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 1076 | 0.87 | 5.47 | 2.16 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 309 | 0.96 | 5.21 | 2.13 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 79 | 0.99 | 0.23 | 0.13 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| you_look_for__sincere | 79 | 0.99 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.60 | ▃▇▂▁▁ |
| you_look_for__intelligent | 79 | 0.99 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| you_look_for__fun | 89 | 0.99 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 99 | 0.99 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| you_look_for__shared_interests | 121 | 0.99 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| you_perceive_yourself__attractive | 105 | 0.99 | 7.08 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 105 | 0.99 | 8.29 | 1.41 | 2.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 105 | 0.99 | 7.70 | 1.56 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 105 | 0.99 | 8.40 | 1.08 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▃▆▇ |
| you_perceive_yourself__ambitious | 105 | 0.99 | 7.58 | 1.78 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▆ |
A função skim() devolve um tibble, que pode ser usado para extrair estatísticas da base
## Rows: 79
## Columns: 17
## $ skim_type <chr> "factor", "factor", "factor", "factor", "factor",...
## $ skim_variable <chr> "sex", "choice", "partner_race", "frequency_date"...
## $ n_missing <int> 0, 0, 73, 97, 2778, 138, 915, 4107, 63, 79, 0, 79...
## $ complete_rate <dbl> 1.000000000, 1.000000000, 0.991286703, 0.98842205...
## $ factor.ordered <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA...
## $ factor.n_unique <int> 2, 2, 5, 7, 2, 17, 3, 2, 5, 6, NA, NA, NA, NA, NA...
## $ factor.top_counts <chr> "Hom: 4194, Mul: 4184", "ext: 6944, lim: 1434", "...
## $ logical.mean <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.1647171...
## $ logical.count <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "FAL: 699...
## $ numeric.mean <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.sd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p0 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p25 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p50 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p75 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.p100 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ numeric.hist <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
Podemos ver que temos muito campos quase completos e alguns campos bem menos preenchidos.
De modo geral, são campos que foram preenchidos numa pesquisa feita semanas depois do evento.
Retiramos, então os dados com pouca representação
Queremos ter algumas impressões do parceiro no nosso conjunto de dados, e assim fazemos o resumo final para começar a brincar com os dados.
dados_speed_date_partner_side <- dados_speed_date_normalizada %>%
select(
unique_id_number,
partner_unique_id_number,
probability_partner_find_i_liked_partner = probability_i_find_partner_liked_me,
partner_career_macro = career_macro,
starts_with("you_perceive_yourself__")
) %>%
rename_with(
.cols = starts_with("you_perceive_yourself__"),
.fn = ~str_replace(.x, "you_perceive_yourself__", "partner_perceives_himself__")
)
dados_finais <- dados_com_representacao %>%
left_join(
dados_speed_date_partner_side,
by = c("unique_id_number" = "partner_unique_id_number", "partner_unique_id_number" = "unique_id_number" )
) %>%
filter(
across(
.cols = everything(),
.fns = ~!is.na(.x)
)
) %>%
mutate(
across(
.cols = where(is.logical) ,
.fns = as.numeric
)
)
resumo_com_representacao <- skim(dados_finais)
resumo_com_representacao| Name | dados_finais |
| Number of rows | 4885 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 55 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 2456, Mul: 2429 |
| choice | 0 | 1 | FALSE | 2 | ext: 4099, lim: 786 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 2683, Asi: 1210, Lat: 386, Oth: 357 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 1301, Sev: 1242, Onc: 910, Nev: 793 |
| career_macro | 0 | 1 | FALSE | 17 | Aca: 1472, Ban: 1236, Cre: 419, Law: 401 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 2786, Jus: 1996, Too: 103 |
| race | 0 | 1 | FALSE | 5 | Whi: 2687, Asi: 1190, Lat: 406, Oth: 372 |
| goal | 0 | 1 | FALSE | 6 | Fun: 2054, Mee: 1793, Dat: 372, To : 273 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 1396, Ban: 1292, Cre: 408, Law: 379 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.39 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| i_liked_partner | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 283.07 | 156.88 | 4.00 | 160.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| id_within_wave | 0 | 1 | 9.10 | 5.57 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.61 | 11.09 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1 | 17.01 | 4.33 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▂▂▅▇ |
| position_meeting | 0 | 1 | 9.13 | 5.50 | 1.00 | 4.00 | 9.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.87 | 5.46 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1 | 9.12 | 5.51 | 1.00 | 5.00 | 9.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 0 | 1 | 282.91 | 156.98 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.20 | 0.30 | -0.83 | -0.02 | 0.22 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.16 | 3.44 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 0 | 1 | 26.19 | 3.41 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▂▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.24 | 0.50 | ▁▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| importance_same_race | 0 | 1 | 3.83 | 2.83 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▂ |
| importance_same_religion | 0 | 1 | 3.61 | 2.85 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▂▃▂▂ |
| happy_expec | 0 | 1 | 5.49 | 1.78 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.25 | 1.94 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 0 | 1 | 7.22 | 1.72 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.43 | 1.52 | 0.00 | 7.00 | 8.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.48 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.82 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.55 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.21 | 1.93 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.17 | 1.74 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.39 | 1.53 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.43 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 0 | 1 | 6.76 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.50 | 2.13 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.33 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.23 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▆▇▇▃▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.14 | 1.41 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.44 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.77 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.50 | 1.08 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.65 | 1.83 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.11 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.33 | 1.44 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.73 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.08 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.63 | 1.82 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
Podemos ver, por exemplo, se as pessoas têm uma imagem acurada da própria atratividade
escala_sexo = c(Homem = "darkblue", Mulher = "darkred")
dados_finais %>%
ggplot(
aes(
y = partner_found_me__attractive,
x = you_perceive_yourself__attractive
)
) +
geom_boxplot(
aes(
group = you_perceive_yourself__attractive,
color = sex,
fill = sex,
alpha = 0.3
),
show.legend = FALSE
) +
scale_color_manual(
values = escala_sexo
) +
scale_fill_manual(
values = escala_sexo
) +
stat_smooth(
method = "loess",
formula = y ~ x,
show.legend = FALSE,
se = FALSE,
aes(
color = sex
)
) +
geom_function(
fun = identity
) +
facet_wrap(
~sex
) +
scale_x_continuous(
breaks = 0:10
) +
scale_y_continuous(
breaks = 0:10
) +
labs(
x = "Me acho bonito",
y = "Parceiro me acha bonito"
) +
theme_minimal()Como o quanto eu achei o parceiro bom em algum atributo está correlacionado com o fato de eu gostar do parceiro?
dados_grafico_partner_liked <- dados_finais %>%
select(
i_liked_partner,
starts_with("i_found_partner__"),
sex
) %>%
pivot_longer(
cols = -c(i_liked_partner, sex),
names_to = "i_found_partner",
names_pattern = "i_found_partner__(.*)",
values_to = "degree"
) %>%
mutate(
degree = round(degree)
) %>%
group_by(
degree,
i_found_partner,
sex
) %>%
summarise(
i_liked_partner = mean(i_liked_partner),
n = n()
) %>%
filter(
n > 100
)
ggplot(dados_grafico_partner_liked) +
geom_line(
aes(
x = degree,
y = i_liked_partner,
color = sex,
),
size = 1.2
) +
geom_point(
aes(
x = degree,
y = i_liked_partner,
color = sex,
size = n
)
) +
facet_wrap(
~i_found_partner
) +
theme_minimal() +
theme(
legend.position = "top"
) +
scale_x_continuous(
breaks = 1:10
) +
scale_y_continuous(
limits = c(0,1),
breaks = seq(0, to = 1, by = .2),
labels = percent_format(accuracy = 1)
) +
scale_color_manual(
values = escala_sexo
) +
labs(
x = "Gostei deste atributo no parceiro",
y = "Gostei do parceiro. Quero ele(a)"
)Na análise anterior, fizemos a média condicional variável a variável, mas podemos fazer a média condicional a todas as variáveis ao mesmo tempo.
A forma de fazer isso é rodando uma regressão linear de mínimos quadrados ordinários múltipla.
parnsnip é a sucessora do núcleo da caret.
Ela é usada para oferecer uma interface genérica a alguns tipos de modelos de aprensizado estatístico
No caso, escolhemos um modelo linear e usamos como engine a função lm do R
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Agora rodamos efetivamente o modelo
Notem que o modelo é rodado com as interações entre os atributos e a dummy “sex”
A biblioteca yardstick oferece métodos para extrairmos métrica e estimações de dentro dos objetos retornados pelas funções de treinamento da parsnip, como fit()
lm_fit <-
lm_mod %>%
fit( i_liked_partner ~
sex +
i_found_partner__attractive * sex +
i_found_partner__ambitious * sex +
i_found_partner__fun * sex +
i_found_partner__intelligent * sex +
i_found_partner__interests * sex +
i_found_partner__sincere * sex ,
data = dados_finais)
tidy(lm_fit)## # A tibble: 14 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.395 0.0442 -8.92 6.56e-19
## 2 sexHomem 0.0674 0.0643 1.05 2.95e- 1
## 3 i_found_partner__attractive 0.0673 0.00547 12.3 2.77e-34
## 4 i_found_partner__ambitious -0.0205 0.00633 -3.25 1.18e- 3
## 5 i_found_partner__fun 0.0365 0.00629 5.81 6.71e- 9
## 6 i_found_partner__intelligent 0.0189 0.00846 2.23 2.59e- 2
## 7 i_found_partner__interests 0.0479 0.00514 9.31 1.98e-20
## 8 i_found_partner__sincere -0.0164 0.00648 -2.53 1.13e- 2
## 9 sexHomem:i_found_partner__attractive 0.0472 0.00786 6.00 2.07e- 9
## 10 sexHomem:i_found_partner__ambitious -0.00375 0.00918 -0.409 6.83e- 1
## 11 sexHomem:i_found_partner__fun -0.000680 0.00921 -0.0738 9.41e- 1
## 12 sexHomem:i_found_partner__intelligent -0.0312 0.0122 -2.56 1.05e- 2
## 13 sexHomem:i_found_partner__interests -0.00185 0.00732 -0.252 8.01e- 1
## 14 sexHomem:i_found_partner__sincere -0.00353 0.00973 -0.363 7.17e- 1
Mais fácil ver em forma de gráfico
dwplot(tidy(lm_fit), dot_args = list(size = 2, color = "darkblue"),
whisker_args = list(color = "darkblue"),
vline = geom_vline(xintercept = 0, colour = "darkblue", linetype = 2)) +
theme_minimal()Podemos usar a função predict() para gerar estimativas para valores de y dados novos valores de x
medias_i_found <- dados_finais %>%
select(
starts_with("i_found_partner__"),
sex
) %>%
pivot_longer(
cols = -c(sex),
names_to = "i_found_partner",
names_pattern = "i_found_partner__(.*)",
values_to = "degree"
) %>%
mutate(
degree = as.numeric(degree)
) %>%
group_by(
sex,
i_found_partner
) %>%
summarise(
p10 = quantile(degree, probs = 0.1, na.rm = TRUE),
p90 = quantile(degree, probs = 0.9, na.rm = TRUE),
p25 = quantile(degree, probs = 0.25, na.rm = TRUE),
p75 = quantile(degree, probs = 0.75, na.rm = TRUE),
p33 = quantile(degree, probs = 0.33, na.rm = TRUE),
p67 = quantile(degree, probs = 0.67, na.rm = TRUE),
mean = mean(degree, na.rm = TRUE)
) %>%
pivot_wider(
names_from = i_found_partner,
values_from = c(mean, p10, p90, p25, p75, p33, p67)
)
med_h <- medias_i_found %>%
filter(
sex == "Homem"
)
med_m <- medias_i_found %>%
filter(
sex == "Mulher"
)
pontos_novos <-
tribble(
~attractive, ~ambitious, ~fun, ~intelligent, ~interests, ~sincere, ~sex, ~nome,
med_h$mean_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "Média",
med_h$p10_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P10",
med_h$p25_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P25",
med_h$p90_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P90",
med_h$p75_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P75",
med_h$p33_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P33",
med_h$p67_attractive, med_h$mean_ambitious, med_h$mean_fun, med_h$mean_intelligent, med_h$mean_interests, med_h$mean_sincere, "Homem", "P67",
med_m$mean_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "Média",
med_m$p10_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P10",
med_m$p90_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P90",
med_m$p25_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P25",
med_m$p75_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P75",
med_m$p33_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P33",
med_m$p67_attractive, med_m$mean_ambitious, med_m$mean_fun, med_m$mean_intelligent, med_m$mean_interests, med_m$mean_sincere, "Mulher", "P67"
) %>%
rename_with(
.cols = -c(sex, nome),
.fn = ~str_glue("i_found_partner__{.x}")
)
conf_int_pred <- predict(lm_fit,
new_data = pontos_novos,
type = "conf_int")
mean_pred <- predict(lm_fit,
new_data = pontos_novos
)
dados_pred <- pontos_novos %>%
bind_cols(
conf_int_pred
) %>%
bind_cols(
mean_pred
) %>%
view()
ggplot(dados_pred, aes(x = i_found_partner__attractive)) +
geom_point(aes(y = .pred, color = sex)) +
geom_errorbar(aes(ymin = .pred_lower,
ymax = .pred_upper, color = sex),
width = .2) +
labs(y = "Prob. I like partner")+
# geom_mark_circle(
# aes(
# y = .pred,
# label = nome,
# group = interaction(sex, nome),
# color = sex,
# fill = sex
# ),
# label.fontsize = 7,
# con.cap = 1,
# expand = 0.001,
# label.buffer = unit(1, 'mm'),
# show.legend = FALSE
# ) +
theme_minimal() +
theme(
legend.position = "top"
) +
geom_line(
aes(
color = sex,
y = .pred
)
) +
scale_color_manual(
values = escala_sexo
) +
scale_x_continuous(
breaks = 1:10
) +
scale_y_continuous(
breaks = seq(0, to = 1, by= 0.2),
limits = c(0,1),
label = percent_format(accuracy = 1)
)Agora vamos sair do modelo linear e rodar uma rede neural
dados_finais_nao_nulos_sex_numerico <- dados_finais %>%
mutate(
sex = if_else(sex == "Homem", 1, 0) ,
i_liked_partner = as.numeric(i_liked_partner),
) %>%
filter(
across(
.cols = everything(),
.fns = ~!is.na(.x)
)
)
pontos_novos_rand_for <- pontos_novos %>%
mutate(
sex = if_else(sex == "Homem", 1, 0)
)
set.seed(192)
modelo_nnet <- mlp(mode = "regression", hidden_units = 10 ) %>%
set_engine("nnet")
modelo_nnet## Single Layer Neural Network Specification (regression)
##
## Main Arguments:
## hidden_units = 10
##
## Computational engine: nnet
fit_nnet <- modelo_nnet %>% fit( i_liked_partner ~
i_found_partner__attractive +
i_found_partner__ambitious +
i_found_partner__fun +
i_found_partner__intelligent +
i_found_partner__interests +
i_found_partner__sincere +
sex,
data = dados_finais_nao_nulos_sex_numerico)
fit_nnet## parsnip model object
##
## Fit time: 940ms
## a 7-10-1 network with 91 weights
## inputs: i_found_partner__attractive i_found_partner__ambitious i_found_partner__fun i_found_partner__intelligent i_found_partner__interests i_found_partner__sincere sex
## output(s): i_liked_partner
## options were - linear output units
No caso da rede neural, as relações não precisam ser lineares. É o caso aqui
mean_pred <- predict(fit_nnet,
new_data = pontos_novos_rand_for
)
dados_pred_nnet <- pontos_novos %>%
bind_cols(
mean_pred
)
ggplot(dados_pred_nnet, aes(x = i_found_partner__attractive)) +
geom_point(aes(y = .pred, color = sex)) +
labs(y = "urchin size")+
geom_mark_circle(
aes(
y = .pred,
label = nome,
group = interaction(nome, sex),
color = sex,
fill = sex
),
label.fontsize = 8,
con.cap = 1,
expand = 0.001,
label.buffer = unit(3.5, 'mm'),
show.legend = FALSE
) +
theme_minimal() +
theme(
legend.position = "top"
) +
geom_line(
aes(
color = sex,
y = .pred
)
) +
scale_color_manual(
values = escala_sexo
) +
scale_y_continuous(
breaks = seq(0, to = 1, by= 0.2),
limits = c(0,1)
) +
labs(y = "Prob. I like partner")É importante fazer uma sessão de exploração, que pode ser muito mais detalhada do que a que fizemos.
A sessão de exploração nos ajuda fazer alguns testes de sanidade nos dados e a extrair alguns insights que podem ou não ser usados para construir o processo Feature Engineering que pode ajudar o modelo atingir melhores resultados.
O processo de feature engineering é o lugar onde mais podemos melhorar o tipo de modelo que vamos usar na maioria das vezes.
A dependência desse processo é menor quando usamos modelos muito complexos, de deep learning, mas para isso é necessário ter uma quantidade colossal de dados.
Tudo o que fizermos durante o processo de seleção do modelo, como já vimos, deve ser feito nos dados de treinamento (que tambem servirão como validação).
Após a escolha de UM modelo, vamos avaliá-lo nos dados de teste.
Fonte: Feature Engineering and Selection: A Practical Approach for Predictive Models (Kuhn e Johnson)
A biblioteca rsamples oferece a infraestrutura necessária para retirar amostras dos dados disponíveis.
Usamos ela aqui para isolar os dados de teste.
Ela será usada novamente para criar as amostras usadas no cross-validation.
set.seed() é usada para manter a reprodutibilidade. Com a mesma semente, garantimos que a cada execução do script a mesma sequência (pseudo)aleatória será gerada.
O parâmetro strata garante que o balanceamento de um dos atributos (no caso o que usaremos como saída) será mantido nas duas partições.
dados_classificacao <- dados_finais %>%
mutate(
i_liked_partner = if_else(i_liked_partner == 1, "Liked", "Not") %>% factor(levels = c("Liked","Not"))
)
set.seed(123)
# Put 3/4 of the data into the training set
split_dado <- initial_split(
data = dados_classificacao,
strata = i_liked_partner,
prop = 3/4
)
# Create data frames for the two sets:
dado_treino <- training(split_dado)
dado_teste <- testing(split_dado)| Name | dado_treino |
| Number of rows | 3664 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 54 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 1845, Mul: 1819 |
| choice | 0 | 1 | FALSE | 2 | ext: 3076, lim: 588 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 2015, Asi: 919, Lat: 270, Oth: 267 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 968, Sev: 920, Onc: 706, Nev: 602 |
| career_macro | 0 | 1 | FALSE | 17 | Aca: 1088, Ban: 932, Cre: 320, Law: 285 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 2116, Jus: 1473, Too: 75 |
| race | 0 | 1 | FALSE | 5 | Whi: 2049, Asi: 895, Lat: 302, Oth: 258 |
| goal | 0 | 1 | FALSE | 6 | Fun: 1528, Mee: 1351, Dat: 281, To : 215 |
| i_liked_partner | 0 | 1 | FALSE | 2 | Not: 2017, Lik: 1647 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 1055, Ban: 970, Cre: 303, Law: 283 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.18 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.40 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.43 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 283.23 | 157.01 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| id_within_wave | 0 | 1 | 9.14 | 5.60 | 1.00 | 4.00 | 8.00 | 14.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.69 | 11.17 | 1.00 | 8.00 | 16.00 | 27.00 | 44.00 | ▇▇▅▅▂ |
| n_people_met_in_wave | 0 | 1 | 17.02 | 4.35 | 5.00 | 14.00 | 18.00 | 20.00 | 22.00 | ▁▂▂▃▇ |
| position_meeting | 0 | 1 | 9.09 | 5.48 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.85 | 5.45 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partnet_id_within_wave | 0 | 1 | 9.23 | 5.49 | 1.00 | 5.00 | 9.00 | 14.00 | 22.00 | ▇▆▅▅▂ |
| partner_unique_id_number | 0 | 1 | 283.12 | 156.95 | 4.00 | 158.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.19 | 0.30 | -0.83 | -0.02 | 0.21 | 0.43 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.14 | 3.45 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_age | 0 | 1 | 26.19 | 3.35 | 18.00 | 24.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 1.00 | ▇▃▁▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.21 | 0.07 | 0.00 | 0.18 | 0.20 | 0.25 | 0.50 | ▁▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| importance_same_race | 0 | 1 | 3.87 | 2.85 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▃▂▂ |
| importance_same_religion | 0 | 1 | 3.60 | 2.83 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▂▃▂▁ |
| happy_expec | 0 | 1 | 5.52 | 1.77 | 1.00 | 5.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.23 | 1.93 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| i_found_partner__sincere | 0 | 1 | 7.19 | 1.73 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.42 | 1.51 | 0.00 | 7.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.46 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.82 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.54 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.20 | 1.95 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.18 | 1.73 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.39 | 1.54 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.42 | 1.93 | 0.00 | 5.00 | 7.00 | 8.00 | 11.00 | ▁▂▇▇▂ |
| partner_found_me__ambitious | 0 | 1 | 6.75 | 1.80 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.51 | 2.14 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.33 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.18 | 0.20 | 0.24 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.10 | 0.11 | 0.16 | 0.30 | ▅▇▇▃▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.13 | 1.42 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.43 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.75 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.50 | 1.07 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.64 | 1.84 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.14 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▅▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.10 | 1.40 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.35 | 1.42 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.72 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.09 | 3.00 | 8.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.65 | 1.80 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▆ |
| Name | dado_teste |
| Number of rows | 1221 |
| Number of columns | 64 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 54 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | Hom: 611, Mul: 610 |
| choice | 0 | 1 | FALSE | 2 | ext: 1023, lim: 198 |
| partner_race | 0 | 1 | FALSE | 5 | Whi: 668, Asi: 291, Lat: 116, Oth: 90 |
| frequency_date | 0 | 1 | TRUE | 7 | Twi: 333, Sev: 322, Onc: 204, Nev: 191 |
| career_macro | 0 | 1 | FALSE | 16 | Aca: 384, Ban: 304, Law: 116, Cre: 99 |
| opinion_duration_of_date | 0 | 1 | FALSE | 3 | Too: 670, Jus: 523, Too: 28 |
| race | 0 | 1 | FALSE | 5 | Whi: 638, Asi: 295, Oth: 114, Lat: 104 |
| goal | 0 | 1 | FALSE | 6 | Fun: 526, Mee: 442, Dat: 91, Oth: 58 |
| i_liked_partner | 0 | 1 | FALSE | 2 | Not: 672, Lik: 549 |
| partner_career_macro | 0 | 1 | FALSE | 17 | Aca: 341, Ban: 322, Cre: 105, Law: 96 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| match | 0 | 1 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| same_race | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| partner_liked_me | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| met_before | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| unique_id_number | 0 | 1 | 282.60 | 156.57 | 4.00 | 160.00 | 274.00 | 408.00 | 552.00 | ▆▆▇▆▇ |
| id_within_wave | 0 | 1 | 8.98 | 5.46 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| subject_within_gender | 0 | 1 | 17.37 | 10.87 | 1.00 | 8.00 | 16.00 | 26.00 | 44.00 | ▇▇▆▅▂ |
| n_people_met_in_wave | 0 | 1 | 16.97 | 4.27 | 5.00 | 15.00 | 18.00 | 20.00 | 22.00 | ▁▃▂▅▇ |
| position_meeting | 0 | 1 | 9.23 | 5.53 | 1.00 | 4.00 | 9.00 | 14.00 | 22.00 | ▇▆▅▅▂ |
| order_meeting | 0 | 1 | 8.95 | 5.47 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▅▂ |
| partnet_id_within_wave | 0 | 1 | 8.81 | 5.57 | 1.00 | 4.00 | 8.00 | 13.00 | 22.00 | ▇▆▅▃▂ |
| partner_unique_id_number | 0 | 1 | 282.29 | 157.14 | 4.00 | 156.00 | 274.00 | 411.00 | 552.00 | ▇▆▇▆▇ |
| interests_correlation | 0 | 1 | 0.21 | 0.31 | -0.83 | -0.01 | 0.23 | 0.44 | 0.91 | ▁▃▇▇▂ |
| my_age | 0 | 1 | 26.19 | 3.39 | 18.00 | 24.00 | 26.00 | 28.00 | 42.00 | ▂▇▅▁▁ |
| partner_age | 0 | 1 | 26.18 | 3.58 | 18.00 | 23.00 | 26.00 | 28.00 | 55.00 | ▇▇▁▁▁ |
| partner_stated_pref_time0_attractive | 0 | 1 | 0.23 | 0.12 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▆▇▂▁▁ |
| partner_stated_pref_time0_sincere | 0 | 1 | 0.17 | 0.07 | 0.00 | 0.15 | 0.18 | 0.20 | 0.47 | ▂▇▇▁▁ |
| partner_stated_pref_time0_intelligent | 0 | 1 | 0.20 | 0.07 | 0.00 | 0.17 | 0.20 | 0.23 | 0.50 | ▂▇▃▁▁ |
| partner_stated_pref_time0_fun | 0 | 1 | 0.18 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| partner_stated_pref_time0_ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▇▇▇▁▁ |
| partner_stated_pref_time0_shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.08 | 0.11 | 0.16 | 0.30 | ▆▇▇▅▁ |
| importance_same_race | 0 | 1 | 3.68 | 2.79 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▁ |
| importance_same_religion | 0 | 1 | 3.66 | 2.90 | 1.00 | 1.00 | 3.00 | 6.00 | 10.00 | ▇▃▂▂▂ |
| happy_expec | 0 | 1 | 5.43 | 1.82 | 1.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▁▃▇▅▁ |
| i_found_partner__attractive | 0 | 1 | 6.29 | 1.97 | 0.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▃ |
| i_found_partner__sincere | 0 | 1 | 7.29 | 1.71 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__intelligent | 0 | 1 | 7.46 | 1.55 | 0.00 | 7.00 | 8.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| i_found_partner__fun | 0 | 1 | 6.54 | 1.94 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| i_found_partner__ambitious | 0 | 1 | 6.83 | 1.79 | 0.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▆▇▃ |
| i_found_partner__interests | 0 | 1 | 5.57 | 2.15 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▆▂ |
| partner_found_me__attractive | 0 | 1 | 6.23 | 1.89 | 1.00 | 5.00 | 6.00 | 8.00 | 10.00 | ▁▃▇▇▂ |
| partner_found_me__sincere | 0 | 1 | 7.13 | 1.74 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__intelligent | 0 | 1 | 7.36 | 1.50 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▃▇▃ |
| partner_found_me__fun | 0 | 1 | 6.45 | 1.97 | 0.00 | 5.00 | 7.00 | 8.00 | 10.00 | ▁▂▇▇▃ |
| partner_found_me__ambitious | 0 | 1 | 6.80 | 1.76 | 1.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▂▆▇▃ |
| partner_found_me__interests | 0 | 1 | 5.48 | 2.11 | 0.00 | 4.00 | 6.00 | 7.00 | 10.00 | ▂▅▇▅▂ |
| probability_i_find_partner_liked_me | 0 | 1 | 5.36 | 2.12 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▃▇▅▁ |
| you_look_for__attractive | 0 | 1 | 0.22 | 0.11 | 0.00 | 0.15 | 0.20 | 0.25 | 0.90 | ▇▇▁▁▁ |
| you_look_for__sincere | 0 | 1 | 0.18 | 0.07 | 0.00 | 0.15 | 0.19 | 0.20 | 0.47 | ▁▇▇▁▁ |
| you_look_for__intelligent | 0 | 1 | 0.20 | 0.06 | 0.00 | 0.18 | 0.20 | 0.23 | 0.50 | ▁▇▃▁▁ |
| you_look_for__fun | 0 | 1 | 0.17 | 0.06 | 0.00 | 0.15 | 0.18 | 0.20 | 0.50 | ▂▇▂▁▁ |
| you_look_for__ambitious | 0 | 1 | 0.11 | 0.06 | 0.00 | 0.05 | 0.10 | 0.15 | 0.36 | ▆▇▇▁▁ |
| you_look_for__shared_interests | 0 | 1 | 0.12 | 0.06 | 0.00 | 0.08 | 0.11 | 0.16 | 0.30 | ▆▇▆▅▁ |
| you_perceive_yourself__attractive | 0 | 1 | 7.18 | 1.38 | 2.00 | 7.00 | 7.00 | 8.00 | 10.00 | ▁▂▂▇▂ |
| you_perceive_yourself__sincere | 0 | 1 | 8.33 | 1.45 | 2.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| you_perceive_yourself__fun | 0 | 1 | 7.83 | 1.58 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▇ |
| you_perceive_yourself__intelligent | 0 | 1 | 8.49 | 1.08 | 3.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▆▇ |
| you_perceive_yourself__ambitious | 0 | 1 | 7.68 | 1.80 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
| probability_partner_find_i_liked_partner | 0 | 1 | 5.27 | 2.05 | 0.00 | 4.00 | 5.00 | 7.00 | 10.00 | ▂▃▇▅▁ |
| partner_perceives_himself__attractive | 0 | 1 | 7.15 | 1.39 | 2.00 | 6.00 | 7.00 | 8.00 | 10.00 | ▁▁▂▇▂ |
| partner_perceives_himself__sincere | 0 | 1 | 8.27 | 1.49 | 2.00 | 7.00 | 9.00 | 9.00 | 10.00 | ▁▁▁▆▇ |
| partner_perceives_himself__fun | 0 | 1 | 7.75 | 1.57 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▁▂▇▆ |
| partner_perceives_himself__intelligent | 0 | 1 | 8.46 | 1.06 | 4.00 | 8.00 | 9.00 | 9.00 | 10.00 | ▁▁▂▅▇ |
| partner_perceives_himself__ambitious | 0 | 1 | 7.59 | 1.89 | 2.00 | 7.00 | 8.00 | 9.00 | 10.00 | ▁▂▂▇▇ |
Conforme comentamos, as bibliotecas da tidymodels são ortogonais.
A biblioteca recipes serve a um fim específico: definir os passos do pré-processamento dos dados. Esses passos podem ser definidos de forma independente da definição do modelo a ser usado, da forma de cross-validation, da medição da performance etc.
Um dos passos que podem ser definidos na recipes é a identificação de atributos que não têm papel preditivo e, portanto, não devem ser usados no treinamento e na predição, mas que queremos manter no nosso tibble para identificação das linhas.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
)
summary(receita)## # A tibble: 64 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 sex nominal predictor original
## 2 choice nominal predictor original
## 3 partner_race nominal predictor original
## 4 frequency_date nominal predictor original
## 5 career_macro nominal predictor original
## 6 opinion_duration_of_date nominal predictor original
## 7 race nominal predictor original
## 8 goal nominal predictor original
## 9 match numeric ID original
## 10 same_race numeric predictor original
## # ... with 54 more rows
Alguns engines de modelos não trabalham bem com fatores.
Quem está acostumado com a lm, sabe que os fatores são transformados automaticamente em dummies, mas isso não acontece com todos os engines.
step_dummy() faz esse trabalho, ou seja, cria uma variável binária pra cada level do fator (menos um). Veja como podemos usar o seletor all_nominal() e o all_outcomes()
step_zv() retira as variáveis com variância zero. Isso vai acontecer bastante quando temos levels de fatores infrequentes.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_dummy(
all_nominal(), -all_outcomes()
) %>%
step_zv(all_predictors())
summary(receita) %>%
gt()| variable | type | role | source |
|---|---|---|---|
| sex | nominal | predictor | original |
| choice | nominal | predictor | original |
| partner_race | nominal | predictor | original |
| frequency_date | nominal | predictor | original |
| career_macro | nominal | predictor | original |
| opinion_duration_of_date | nominal | predictor | original |
| race | nominal | predictor | original |
| goal | nominal | predictor | original |
| match | numeric | ID | original |
| same_race | numeric | predictor | original |
| partner_liked_me | numeric | predictor | original |
| met_before | numeric | predictor | original |
| unique_id_number | numeric | ID | original |
| id_within_wave | numeric | ID | original |
| subject_within_gender | numeric | ID | original |
| n_people_met_in_wave | numeric | predictor | original |
| position_meeting | numeric | predictor | original |
| order_meeting | numeric | predictor | original |
| partnet_id_within_wave | numeric | ID | original |
| partner_unique_id_number | numeric | ID | original |
| interests_correlation | numeric | predictor | original |
| my_age | numeric | predictor | original |
| partner_age | numeric | predictor | original |
| partner_stated_pref_time0_attractive | numeric | predictor | original |
| partner_stated_pref_time0_sincere | numeric | predictor | original |
| partner_stated_pref_time0_intelligent | numeric | predictor | original |
| partner_stated_pref_time0_fun | numeric | predictor | original |
| partner_stated_pref_time0_ambitious | numeric | predictor | original |
| partner_stated_pref_time0_shared_interests | numeric | predictor | original |
| importance_same_race | numeric | predictor | original |
| importance_same_religion | numeric | predictor | original |
| happy_expec | numeric | predictor | original |
| i_found_partner__attractive | numeric | predictor | original |
| i_found_partner__sincere | numeric | predictor | original |
| i_found_partner__intelligent | numeric | predictor | original |
| i_found_partner__fun | numeric | predictor | original |
| i_found_partner__ambitious | numeric | predictor | original |
| i_found_partner__interests | numeric | predictor | original |
| partner_found_me__attractive | numeric | predictor | original |
| partner_found_me__sincere | numeric | predictor | original |
| partner_found_me__intelligent | numeric | predictor | original |
| partner_found_me__fun | numeric | predictor | original |
| partner_found_me__ambitious | numeric | predictor | original |
| partner_found_me__interests | numeric | predictor | original |
| probability_i_find_partner_liked_me | numeric | predictor | original |
| you_look_for__attractive | numeric | predictor | original |
| you_look_for__sincere | numeric | predictor | original |
| you_look_for__intelligent | numeric | predictor | original |
| you_look_for__fun | numeric | predictor | original |
| you_look_for__ambitious | numeric | predictor | original |
| you_look_for__shared_interests | numeric | predictor | original |
| you_perceive_yourself__attractive | numeric | predictor | original |
| you_perceive_yourself__sincere | numeric | predictor | original |
| you_perceive_yourself__fun | numeric | predictor | original |
| you_perceive_yourself__intelligent | numeric | predictor | original |
| you_perceive_yourself__ambitious | numeric | predictor | original |
| probability_partner_find_i_liked_partner | numeric | predictor | original |
| partner_career_macro | nominal | predictor | original |
| partner_perceives_himself__attractive | numeric | predictor | original |
| partner_perceives_himself__sincere | numeric | predictor | original |
| partner_perceives_himself__fun | numeric | predictor | original |
| partner_perceives_himself__intelligent | numeric | predictor | original |
| partner_perceives_himself__ambitious | numeric | predictor | original |
| i_liked_partner | nominal | outcome | original |
Alguns fatores são ordinais. Pode ser uma boa ideia codificá-los em uma só variável numérica, que vai manter a ordem natural dos levels.
No nosso exemplo, a variável que representa com qual frequência as pessoas saem à noite apresenta níveis que podem ser ordenados.
receita <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_zv(all_predictors()) Sabendo criar uma receita de pré-processamento e relembrando como criar uma interface genérica para um modelo com parsnip() e como selecionar um engine pra ele, podemos criar um pequeno fluxo de trabalho para realizar esse processamento, usando a biblioteca workflows()
lr_mod <-
logistic_reg() %>%
set_engine("glm")
wf <- workflow() %>%
add_recipe(receita) %>%
add_model(lr_mod)
wf## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_zv()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Com o workflow criado é possível estimá-lo usando a função fit()
fit_teste <-
wf %>%
fit(
data = dado_treino
)
fit_teste %>% tidy() %>%
select(
term,
estimate,
p.value
) %>%
arrange(
p.value
) %>%
gt() %>%
fmt_number(
columns = one_of("estimate"),
decimals = 3
) %>%
fmt_number(
columns = one_of("p.value"),
decimals = 2
) | term | estimate | p.value |
|---|---|---|
| i_found_partner__attractive | −0.573 | 0.00 |
| probability_i_find_partner_liked_me | −0.308 | 0.00 |
| i_found_partner__interests | −0.222 | 0.00 |
| i_found_partner__fun | −0.279 | 0.00 |
| you_look_for__attractive | 5.085 | 0.00 |
| you_perceive_yourself__fun | 0.171 | 0.00 |
| race_White | 0.620 | 0.00 |
| sex_Homem | −0.536 | 0.00 |
| partner_found_me__attractive | 0.120 | 0.00 |
| career_macro_Politics | 2.790 | 0.00 |
| you_perceive_yourself__sincere | 0.133 | 0.00 |
| same_race | −0.376 | 0.00 |
| importance_same_race | 0.063 | 0.00 |
| i_found_partner__ambitious | 0.123 | 0.00 |
| goal_Serious | −0.984 | 0.00 |
| career_macro_Pro.sports.Athletics | −4.469 | 0.00 |
| probability_partner_find_i_liked_partner | −0.083 | 0.00 |
| you_look_for__ambitious | 4.241 | 0.00 |
| partner_career_macro_Speech.Pathology | 2.574 | 0.00 |
| n_people_met_in_wave | −0.054 | 0.00 |
| i_found_partner__sincere | 0.110 | 0.01 |
| career_macro_Lawyer | 0.515 | 0.01 |
| partner_career_macro_Psychologist | 0.729 | 0.01 |
| partner_career_macro_Creative.Arts.Entertainment | 0.443 | 0.01 |
| career_macro_Other | 1.721 | 0.03 |
| goal_Other | 0.623 | 0.03 |
| frequency_date | 0.077 | 0.03 |
| importance_same_religion | 0.040 | 0.03 |
| partner_liked_me | 0.229 | 0.04 |
| partner_found_me__interests | 0.060 | 0.04 |
| partner_career_macro_Journalism | −1.347 | 0.04 |
| career_macro_Social.Work | −0.596 | 0.04 |
| my_age | 0.029 | 0.05 |
| partner_career_macro_Real.Estate | 1.255 | 0.05 |
| choice_limited | −0.429 | 0.05 |
| happy_expec | −0.057 | 0.06 |
| opinion_duration_of_date_Too.little | 0.188 | 0.06 |
| race_Latino | 0.352 | 0.07 |
| you_perceive_yourself__attractive | 0.078 | 0.08 |
| you_look_for__sincere | 1.770 | 0.08 |
| you_look_for__intelligent | 1.716 | 0.09 |
| partner_perceives_himself__intelligent | 0.089 | 0.09 |
| opinion_duration_of_date_Too.much | −0.538 | 0.09 |
| career_macro_Creative.Arts.Entertainment | 0.312 | 0.09 |
| partner_perceives_himself__fun | −0.059 | 0.11 |
| partner_career_macro_Lawyer | −0.311 | 0.12 |
| partner_race_Others | −0.314 | 0.12 |
| partner_career_macro_Pro.sports.Athletics | −2.273 | 0.12 |
| (Intercept) | 2.235 | 0.13 |
| partner_found_me__intelligent | −0.061 | 0.17 |
| goal_Meet.new.people | −0.244 | 0.18 |
| career_macro_Engineer | 0.390 | 0.19 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | −0.168 | 0.20 |
| partner_stated_pref_time0_intelligent | −1.271 | 0.20 |
| partner_perceives_himself__ambitious | 0.040 | 0.22 |
| partner_career_macro_Doctor.Medicine | −0.248 | 0.22 |
| career_macro_Doctor.Medicine | 0.231 | 0.24 |
| partner_career_macro_Social.Work | 0.354 | 0.26 |
| partner_perceives_himself__sincere | −0.041 | 0.26 |
| position_meeting | −0.010 | 0.26 |
| you_look_for__fun | −1.132 | 0.27 |
| race_Black | −0.278 | 0.27 |
| partner_found_me__ambitious | −0.037 | 0.27 |
| career_macro_Journalism | −0.690 | 0.27 |
| you_perceive_yourself__ambitious | −0.033 | 0.30 |
| interests_correlation | 0.150 | 0.33 |
| partner_career_macro_Engineer | 0.265 | 0.38 |
| i_found_partner__intelligent | −0.043 | 0.38 |
| partner_perceives_himself__attractive | 0.036 | 0.40 |
| career_macro_Psychologist | −0.223 | 0.40 |
| career_macro_Speech.Pathology | 0.986 | 0.42 |
| partner_career_macro_Architecture | −0.810 | 0.42 |
| career_macro_Undecided | −0.187 | 0.43 |
| partner_found_me__sincere | −0.027 | 0.46 |
| partner_stated_pref_time0_sincere | 0.732 | 0.48 |
| career_macro_Architecture | −0.753 | 0.50 |
| partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | −0.078 | 0.55 |
| race_Others | −0.121 | 0.56 |
| partner_race_Black | −0.124 | 0.57 |
| career_macro_Real.Estate | −0.987 | 0.60 |
| partner_career_macro_Other | −0.298 | 0.60 |
| career_macro_International.Humanitarian.Affairs | 0.094 | 0.65 |
| partner_stated_pref_time0_ambitious | 0.479 | 0.71 |
| partner_career_macro_International.Humanitarian.Affairs | 0.063 | 0.74 |
| goal_To.say | −0.084 | 0.75 |
| goal_Fun | 0.054 | 0.76 |
| partner_stated_pref_time0_fun | 0.274 | 0.79 |
| partner_stated_pref_time0_attractive | −0.134 | 0.87 |
| partner_career_macro_Undecided | 0.035 | 0.88 |
| met_before | 0.027 | 0.90 |
| you_perceive_yourself__intelligent | −0.006 | 0.91 |
| partner_found_me__fun | −0.004 | 0.91 |
| partner_race_White | −0.014 | 0.91 |
| partner_race_Latino | −0.015 | 0.94 |
| order_meeting | 0.000 | 0.97 |
| partner_age | 0.001 | 0.97 |
| partner_career_macro_Politics | −0.022 | 0.97 |
| partner_stated_pref_time0_shared_interests | NA | NA |
| you_look_for__shared_interests | NA | NA |
Um tipo de passo interessante que pode ser adicionado à recipe é a adição de novos preditores que representam interações entre os preditores originais.
É interessante poder usar as funções helpers, que tornam muito mais fácil a adição de vários termos de interação de uma só vez.
Aqui vamos usar uma regressão ao invés de classificação para facilitar a interpretação dos coeficientes.
dado_treino_regressao <- dado_treino %>%
mutate(
i_liked_partner = if_else(i_liked_partner == "Liked", 1, 0)
)
receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_interact(
terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
) %>%
step_interact(
terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("interests_correlation")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_age")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
) %>%
step_zv(all_predictors()) Alguns modelos funcionam melhor com os preditores normalizados, por exemplo os que aplicam penalidades aos coeficientes que multiplicam os preditores. Com os preditores em uma faixa parecida, esta penalização é mais justa.
receita_com_interacao <- recipe(i_liked_partner ~ ., data = dado_treino_regressao) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_interact(
terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
) %>%
step_interact(
terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("interests_correlation")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_age")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("career_macro")*starts_with("partner_career_macro")
) %>%
step_zv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())Com a adição destes termos de interação, ficamos com muitos preditores, vários deles bem correlacionados.
O uso de termos muito correlacionados (não só em pares) pode levar ao fenômeno da colinearidade. O modelo pode atribuir efeito a uma ou outra variável de entrada de acordo com a amostra usada pra treinamento, ficando, portanto, com maior variância.
Uma forma de evitar que muitas variáveis sejam efetivamente usadas no modelo é aplicar uma penalidade de forma a diminuir o número de coeficientes acionados, e, por consequência a variância do modelo. É isso que as regressões do tipo Ridge e Lasso fazem.
Essas regressões que penalizam o número e o tamanho do efeito das relações entre as variáveis de entrada e os coeficientes trocam viés por variância e (no caso da lasso) interpretabilidade: elas têm mais viés, se adaptam menos ao conjunto de treinamento, mas não variam tanto dependendo de qual amostra da população foi escolhida para o treinamento. Além disso, por ter menos coeficientes “ligados” (no caso da lasso), é mais interpretável.
O modelo Elastic Net conjuga a penalização do tipo Ridge com a penalização do tipo Lasso modificando a função de penalização da regressão, que na regressão de mínimos quadrados ordinários, a mais comum, como o nome diz, é o erro quadrático:
\[RSS = \sum_{i = 1}^{n} ( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \]
Para a regressão Ridge, os coeficientes são penalizados de forma quadrática. Isso diminui a variância do modelo mas não diminui tantoo número de coeficientes diferentes de 0:
\[Loss_{Ridge} = RSS + \lambda \sum_{j=1}^{p}\beta_j^2 \]
Para a regressão Lasso, os coeficientes são penalizados pelo seu valor absoluto. Isso diminui a variância do modelo E diminui o número de coeficientes diferentes de 0, favorecendo a interpretabilidade
\[Loss_{Lasso} = RSS + \lambda \sum_{j=1}^{p} \left| \beta_j \right| \]
O conceito de interface é importante na engenharia de software.
É sempre melhor depender de interfaces do que de implementações. A parnsip funciona como uma camada de abstração que oferece uma interface única que se encarrega de cuidar da chamada às diferentes implementações.
As interfaces genéricas da parnsip estão prontas para receber os parâmetros mais comuns usados nos engines e estão preparadas para passar ao engine o valor destes parâmetros.
No caso das regressões lineares, alguns engines como o glmnet estão preparados para receber os hiperparâmetros necessários para a implementação da regressão Elçastic Net lasso-ridge.
lr_mod <-
linear_reg(penalty = .02, mixture = 1) %>%
set_engine("glmnet")
wf_com_interacao <- workflow() %>%
add_recipe(receita_com_interacao) %>%
add_model(lr_mod)
wf_com_interacao## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: linear_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 12 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_zv()
## * ...
## * and 2 more steps.
##
## -- Model -----------------------------------------------------------------------
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 0.02
## mixture = 1
##
## Computational engine: glmnet
Na estimação com Elastic Net temos menos preditores acionados
fit_com_interacao <-
wf_com_interacao %>%
fit(
data = dado_treino_regressao
)
fit_com_interacao %>% tidy() %>%
filter(
estimate != 0
) %>%
arrange(
estimate %>% abs() %>% desc()
) %>%
select(
term,
estimate
) %>%
gt() %>%
fmt_number(
columns = one_of("estimate"),
decimals = 3
)| term | estimate |
|---|---|
| i_found_partner__attractive | 0.290 |
| probability_i_find_partner_liked_me | 0.147 |
| i_found_partner__interests | 0.113 |
| i_found_partner__fun | 0.100 |
| i_found_partner__attractive_x_sex_Homem | 0.081 |
| you_look_for__attractive | −0.055 |
| partner_found_me__attractive | −0.055 |
| race_White | −0.040 |
| you_perceive_yourself__attractive | −0.033 |
| probability_partner_find_i_liked_partner | 0.033 |
| you_perceive_yourself__fun | −0.029 |
| you_perceive_yourself__sincere | −0.029 |
| career_macro_Politics | −0.028 |
| importance_same_race | −0.021 |
| goal_Other | −0.021 |
| you_look_for__shared_interests | 0.021 |
| partner_career_macro_Speech.Pathology | −0.021 |
| goal_Serious | 0.021 |
| i_found_partner__sincere | −0.021 |
| career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Creative.Arts.Entertainment | −0.020 |
| career_macro_Pro.sports.Athletics | 0.020 |
| happy_expec | 0.019 |
| i_found_partner__ambitious | −0.017 |
| partner_liked_me | −0.016 |
| career_macro_Psychologist_x_partner_career_macro_Architecture | 0.015 |
| importance_same_religion | −0.015 |
| career_macro_Lawyer_x_partner_career_macro_Creative.Arts.Entertainment | −0.014 |
| race_Latino_x_importance_same_race | −0.014 |
| you_look_for__fun | 0.012 |
| career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Creative.Arts.Entertainment | −0.011 |
| same_race | 0.011 |
| career_macro_Social.Work | 0.011 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Engineer | −0.010 |
| goal_Meet.new.people | 0.009 |
| opinion_duration_of_date_Too.much | 0.009 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | 0.008 |
| partner_career_macro_Lawyer | 0.008 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Speech.Pathology | 0.008 |
| race_White_x_importance_same_race | −0.008 |
| partner_career_macro_Real.Estate | −0.007 |
| frequency_date | −0.006 |
| career_macro_Other_x_partner_career_macro_Lawyer | −0.006 |
| career_macro_Doctor.Medicine_x_partner_career_macro_International.Humanitarian.Affairs | −0.006 |
| partner_found_me__intelligent | 0.006 |
| career_macro_Other | −0.006 |
| career_macro_Engineer_x_partner_career_macro_Social.Work | −0.005 |
| career_macro_Undecided_x_partner_career_macro_Doctor.Medicine | 0.005 |
| partner_career_macro_Journalism | 0.005 |
| partner_race_Latino_x_importance_same_race | −0.004 |
| career_macro_Lawyer_x_partner_career_macro_Psychologist | −0.004 |
| partner_career_macro_Psychologist | −0.003 |
| career_macro_Creative.Arts.Entertainment | −0.003 |
| partner_found_me__sincere | 0.003 |
| career_macro_Lawyer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | −0.003 |
| career_macro_Creative.Arts.Entertainment_x_partner_career_macro_Speech.Pathology | −0.002 |
| career_macro_Engineer_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | 0.002 |
| n_people_met_in_wave | 0.002 |
| you_look_for__ambitious | −0.002 |
| career_macro_Psychologist_x_partner_career_macro_Psychologist | 0.002 |
| career_macro_International.Humanitarian.Affairs_x_partner_career_macro_Engineer | −0.001 |
| career_macro_Social.Work_x_partner_career_macro_Other | −0.001 |
| career_macro_International.Humanitarian.Affairs_x_partner_career_macro_International.Humanitarian.Affairs | −0.001 |
| race_Others_x_partner_race_Latino | 0.001 |
| race_Latino_x_partner_race_Latino_x_importance_same_race | −0.000 |
| partner_stated_pref_time0_attractive | 0.000 |
| career_macro_Lawyer_x_partner_career_macro_Other | 0.000 |
| position_meeting | 0.000 |
| career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin_x_partner_career_macro_Banking.Consulting.Finance.Marketing.Business.CEO.Entrepreneur.Admin | 0.000 |
| career_macro_Journalism | 0.000 |
| partner_career_macro_Social.Work | −0.000 |
| (Intercept) | −0.000 |
Aqui adicionamos mais alguns passos de pré-processamento:
step_mutate() cria um novo preditor baseado em um cálculo em cima de outros
step_poly() adiciona preditores elevados a números naturais maiores que 1 (por default 2) possibilitando a representação de efeitos não lineares (quadráticos, por exemplo)
step_corr() retira preditores de forma que não haja mais nenhum par com correlação acima do threshold.
receita_com_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
) %>%
step_mutate(
diff_age = my_age - partner_age
) %>%
step_poly(
diff_age
) %>%
step_interact(
terms = ~ starts_with("diff_age") * starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("race")*starts_with("partner_race")*importance_same_race
) %>%
step_interact(
terms = ~ starts_with("i_found_partner__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("interests_correlation")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_age")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("you_perceive_yourself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_perceives_himself__")*starts_with("sex_")
) %>%
step_interact(
terms = ~ starts_with("partner_found_me__")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("you_look_for__")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("you_look_for__")*starts_with("i_found_partner")
) %>%
step_interact(
terms = ~ starts_with("probability_partner_find_i_liked_partner")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("career_")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("partner_career_")*starts_with("sex")
) %>%
step_interact(
terms = ~ starts_with("goal")*starts_with("sex")
) %>%
step_corr(
all_predictors(),
threshold = 0.8
) %>%
step_zv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
log_mod <-
logistic_reg() %>%
set_engine("glm")
wf_com_interacao_class <- workflow() %>%
add_recipe(receita_com_interacao_class) %>%
add_model(log_mod)
fit_com_interacao_class <-
wf_com_interacao_class %>%
fit(
data = dado_treino
)A curva ROC (Receiver Operating Characteristics) foi inventada na época da Segunda Guerra Mundial para avaliar se os operadores de radar americanos estavam detectando confiavelmente aeronaves japonesas a partir de sinais de radar.
A curva mostra, para vários thresholds, qual a fração de verdadeiros positivos (ou sensibilidade) e a fração de falsos positivos (fall-out, ou \(1 - especificidade\) ).
Uma métrica numérica que traduz o a precisão geral de um modelo de classificação consiste na área embaixo desta curva (AUC, Area Under the Curve). Note que quanto mais perto de um essa área, menor a taxa de falsos positivos e maior a sensibilidade
Aqui executamos a predição nos próprios dados de treinamento e usamos função roc_curve() da yardstick para gerar os dados necessários para plotara curva
pred_like <- predict(
object = fit_com_interacao_class,
new_data = dado_teste,
type = "prob"
) %>%
bind_cols(dado_teste %>% select(i_liked_partner))
dados_roc <- pred_like %>%
roc_curve(
truth = i_liked_partner,
.pred_Liked
)
dados_roc %>%
filter(
row_number() %% 100 == 0
) %>%
gt() %>%
fmt_number(
columns = everything(),
decimals = 3
)| .threshold | specificity | sensitivity |
|---|---|---|
| 0.015 | 0.137 | 0.989 |
| 0.048 | 0.277 | 0.978 |
| 0.109 | 0.409 | 0.958 |
| 0.188 | 0.543 | 0.940 |
| 0.297 | 0.656 | 0.896 |
| 0.446 | 0.751 | 0.831 |
| 0.542 | 0.833 | 0.749 |
| 0.669 | 0.896 | 0.643 |
| 0.787 | 0.936 | 0.510 |
| 0.871 | 0.967 | 0.366 |
| 0.939 | 0.988 | 0.209 |
| 0.991 | 0.999 | 0.040 |
Plotamos, então, a curva
ponto_gatilho <- dados_roc %>%
filter(
.threshold > 0.5
) %>%
slice_min(
n = 1, order_by = .threshold
)
dados_roc %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_point(
data = ponto_gatilho,
aes(x = 1 - specificity, y = sensitivity),
size = 3,
color = "darkblue"
) +
geom_text_repel(
data = ponto_gatilho,
aes(
x = 1 - specificity + 0.15,
y = sensitivity - 0.15,
label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>% percent(accuracy = 0.1)}")
)
) +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()Existe uma função que plota a curva automaticamente
A função roc_auc calcula a área embaixo da curva
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | binary | 0.8686356 |
receita_sem_interacao_class <- recipe(i_liked_partner ~ ., data = dado_treino) %>%
update_role(
match,
unique_id_number,
id_within_wave,
subject_within_gender,
partnet_id_within_wave,
partner_unique_id_number,
new_role = "ID"
) %>%
step_ordinalscore(
frequency_date
) %>%
step_dummy(
all_nominal(), -frequency_date, -all_outcomes()
)
wf_sem_interacao_class <- workflow() %>%
add_recipe(receita_sem_interacao_class) %>%
add_model(log_mod)
fit_sem_interacao_class <-
wf_sem_interacao_class %>%
fit(
data = dado_treino
)
pred_like_sem <- predict(
object = fit_sem_interacao_class,
new_data = dado_teste,
type = "prob"
) %>%
bind_cols(dado_teste %>% select(i_liked_partner))
pred_like_sem %>%
roc_curve(
truth = i_liked_partner,
.pred_Liked
) %>%
autoplot()| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | binary | 0.8669198 |
As curvas ficaram muito parecidas.
Nao adiantou muito a adição de interações, mas isso pode ser causado pela simplicidade do modelo
rocs <- bind_rows(
roc_curve(pred_like_sem, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Sem interação"),
roc_curve(pred_like, i_liked_partner, .pred_Liked) %>% mutate(tipo = "Com interação"),
)
rocs %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = tipo)) +
geom_path() +
geom_abline(lty = 3) +
coord_equal() +
theme_bw() +
theme(
legend.position = "top"
)A biblioteca rsamples, que já vimos, oferece a infraestrutura para geração das amostras necessárias para o Cross Validation.
Relembrando, temos que montar o seguinte esquema:
No código abaixo, fazemos 2 divisões em 5 partes. Ficaremos, portanto, com 10 particionamentos entre treinamento e validação.
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <split [2.9K/734]> Repeat1 Fold1
## 2 <split [2.9K/734]> Repeat1 Fold2
## 3 <split [2.9K/732]> Repeat1 Fold3
## 4 <split [2.9K/732]> Repeat1 Fold4
## 5 <split [2.9K/732]> Repeat1 Fold5
## 6 <split [2.9K/734]> Repeat2 Fold1
## 7 <split [2.9K/734]> Repeat2 Fold2
## 8 <split [2.9K/732]> Repeat2 Fold3
## 9 <split [2.9K/732]> Repeat2 Fold4
## 10 <split [2.9K/732]> Repeat2 Fold5
## Rows: 2,930
## Columns: 64
## $ sex <fct> Mulher, Mulher, Mulher, ...
## $ choice <fct> limited, limited, limite...
## $ partner_race <fct> Asian, White, White, Whi...
## $ frequency_date <ord> Once a month, Once a mon...
## $ career_macro <fct> Lawyer, Lawyer, Lawyer, ...
## $ opinion_duration_of_date <fct> Just Right, Just Right, ...
## $ race <fct> White, White, White, Whi...
## $ goal <fct> Fun, Fun, Fun, Fun, Fun,...
## $ match <dbl> 0, 1, 0, 0, 0, 1, 0, 0, ...
## $ same_race <dbl> 0, 1, 1, 1, 1, 1, 1, 1, ...
## $ partner_liked_me <dbl> 1, 1, 1, 1, 0, 1, 0, 0, ...
## $ i_liked_partner <fct> Not, Liked, Not, Not, No...
## $ met_before <dbl> 0, 1, 0, 0, 0, 0, 0, 0, ...
## $ unique_id_number <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ id_within_wave <dbl> 4, 4, 4, 4, 4, 4, 4, 5, ...
## $ subject_within_gender <dbl> 7, 7, 7, 7, 7, 7, 7, 9, ...
## $ n_people_met_in_wave <dbl> 10, 10, 10, 10, 10, 10, ...
## $ position_meeting <dbl> 6, 6, 6, 6, 6, 6, 6, 4, ...
## $ order_meeting <dbl> 9, 4, 5, 10, 1, 7, 8, 1,...
## $ partnet_id_within_wave <dbl> 3, 4, 6, 7, 8, 9, 10, 1,...
## $ partner_unique_id_number <dbl> 13, 14, 16, 17, 18, 19, ...
## $ interests_correlation <dbl> 0.05, -0.18, 0.37, 0.35,...
## $ my_age <dbl> 23, 23, 23, 23, 23, 23, ...
## $ partner_age <dbl> 22, 23, 25, 30, 27, 28, ...
## $ partner_stated_pref_time0_attractive <dbl> 0.1900000, 0.3000000, 0....
## $ partner_stated_pref_time0_sincere <dbl> 0.1800000, 0.0500000, 0....
## $ partner_stated_pref_time0_intelligent <dbl> 0.1900000, 0.1500000, 0....
## $ partner_stated_pref_time0_fun <dbl> 0.1800000, 0.4000000, 0....
## $ partner_stated_pref_time0_ambitious <dbl> 0.1400000, 0.0500000, 0....
## $ partner_stated_pref_time0_shared_interests <dbl> 0.1200000, 0.0500000, 0....
## $ importance_same_race <dbl> 1, 1, 1, 1, 1, 1, 1, 8, ...
## $ importance_same_religion <dbl> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ happy_expec <dbl> 1, 1, 1, 1, 1, 1, 1, 7, ...
## $ i_found_partner__attractive <dbl> 4, 8, 5, 7, 5, 9, 8, 5, ...
## $ i_found_partner__sincere <dbl> 7, 10, 10, 10, 9, 8, 9, ...
## $ i_found_partner__intelligent <dbl> 8, 7, 8, 10, 9, 10, 10, ...
## $ i_found_partner__fun <dbl> 8, 10, 4, 7, 5, 10, 10, ...
## $ i_found_partner__ambitious <dbl> 6, 7, 8, 10, 9, 7, 8, 2,...
## $ i_found_partner__interests <dbl> 7, 10, 2, 5, 7, 8, 8, 2,...
## $ partner_found_me__attractive <dbl> 10, 7, 6, 7, 6, 7, 7, 6,...
## $ partner_found_me__sincere <dbl> 10, 7, 6, 6, 7, 7, 8, 8,...
## $ partner_found_me__intelligent <dbl> 10, 7, 7, 3, 8, 7, 8, 8,...
## $ partner_found_me__fun <dbl> 10, 9, 7, 5, 6, 10, 7, 8...
## $ partner_found_me__ambitious <dbl> 10, 9, 8, 6, 6, 9, 8, 7,...
## $ partner_found_me__interests <dbl> 10, 9, 7, 5, 5, 10, 7, 6...
## $ probability_i_find_partner_liked_me <dbl> 1, 10, 3, 1, 6, 8, 8, 5,...
## $ you_look_for__attractive <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__sincere <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__intelligent <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__fun <dbl> 0.20, 0.20, 0.20, 0.20, ...
## $ you_look_for__ambitious <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_look_for__shared_interests <dbl> 0.10, 0.10, 0.10, 0.10, ...
## $ you_perceive_yourself__attractive <dbl> 7, 7, 7, 7, 7, 7, 7, 6, ...
## $ you_perceive_yourself__sincere <dbl> 8, 8, 8, 8, 8, 8, 8, 3, ...
## $ you_perceive_yourself__fun <dbl> 9, 9, 9, 9, 9, 9, 9, 6, ...
## $ you_perceive_yourself__intelligent <dbl> 7, 7, 7, 7, 7, 7, 7, 10,...
## $ you_perceive_yourself__ambitious <dbl> 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ probability_partner_find_i_liked_partner <dbl> 10, 10, 6, 4, 7, 8, 7, 5...
## $ partner_career_macro <fct> Lawyer, Lawyer, Banking/...
## $ partner_perceives_himself__attractive <dbl> 4, 9, 6, 7, 6, 10, 7, 8,...
## $ partner_perceives_himself__sincere <dbl> 7, 9, 6, 7, 8, 6, 7, 9, ...
## $ partner_perceives_himself__fun <dbl> 8, 9, 8, 6, 6, 10, 10, 7...
## $ partner_perceives_himself__intelligent <dbl> 8, 9, 8, 8, 8, 10, 10, 8...
## $ partner_perceives_himself__ambitious <dbl> 3, 9, 6, 4, 9, 10, 10, 5...
Executar o cross validation, como vimos, significa estimar o modelo várias vezes em execuções que são completamente independentes, o que é perfeito para executar em paralelo.
A função fit_resamples() faz todas as estimações.
É possível registrar um backend para fazer a execução de forma paralela
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
log_mod <-
logistic_reg() %>%
set_engine("glm")
wf_com_interacao_class <- workflow() %>%
add_recipe(receita_com_interacao_class) %>%
add_model(log_mod)
fit_com_interacao_resample <-
wf_com_interacao_class %>%
fit_resamples(
folds,
control = control_resamples(
allow_par = TRUE
)
)
stopCluster(cl)
fit_com_interacao_resample## # Resampling results
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [2.9K/734]> Repeat1 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
## 2 <split [2.9K/734]> Repeat1 Fold2 <tibble [2 x 4]> <tibble [2 x 1]>
## 3 <split [2.9K/732]> Repeat1 Fold3 <tibble [2 x 4]> <tibble [3 x 1]>
## 4 <split [2.9K/732]> Repeat1 Fold4 <tibble [2 x 4]> <tibble [2 x 1]>
## 5 <split [2.9K/732]> Repeat1 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>
## 6 <split [2.9K/734]> Repeat2 Fold1 <tibble [2 x 4]> <tibble [3 x 1]>
## 7 <split [2.9K/734]> Repeat2 Fold2 <tibble [2 x 4]> <tibble [3 x 1]>
## 8 <split [2.9K/732]> Repeat2 Fold3 <tibble [2 x 4]> <tibble [2 x 1]>
## 9 <split [2.9K/732]> Repeat2 Fold4 <tibble [2 x 4]> <tibble [3 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [2 x 4]> <tibble [3 x 1]>
A função collect_metrics() recebe o objeto retornado por fit_resamples() e retorna uma estrutura com os resultados das execuções, no nosso caso 10. Como não há hiperparâmetros para variar, há apenas duas linhas, com duas métricas relativas ao mesmo conjunto único de hiperparâmetros.
collect_metrics(fit_com_interacao_resample) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
)| .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|
| accuracy | binary | 76.69% | 10 | 0.28% | Preprocessor1_Model1 |
| roc_auc | binary | 84.66% | 10 | 0.33% | Preprocessor1_Model1 |
O modelo de regressão logística que rodamos anteriormente não tinha hiperparâmetros, mas ao rodar modelos como Elastic Net (lasso-ridge) gostaríamos de avaliar qual o melhor conjunto de hiperparâmetros nos dados de validação.
A biblioteca tune nos ajuda a fazer esse procedimento.
No momento de definir o modelo, atribuimos a resposta de tune() aos parâmetros do modelo que queremos variar para fins de tuning
tune_spec_logistic_reg <- logistic_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine(
engine = "glmnet"
) %>%
set_mode(
"classification"
)
tune_spec_logistic_reg## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = tune()
##
## Computational engine: glmnet
Uma das formas de executar esta busca pelo melhor conjunto de hiperparâmetros é criar um grid com várias especificações.
A biblioteca dials ajuda na criação deste grid de valores pros hiperparâmetros.
A função grid_regular() cria um grid com intervalos regulares (esses intervalos regulares, dependendo do parâmetro podem ser regulares em log, por exemplo).
Existem funções com os mesmos nomes usados para os parâmetros na interface genérica da parnsnip. Quando chamadas sem parâmetros, essas funções geram valores que normalmente fazem sentido, mas é possível escolher os valores de forma personalizada.
O parâmetro levels define quantos valores diferentes serão usados para cada parâmetro.
net_grid <- grid_regular(
penalty(range = c(-4,-1)),
mixture(),
levels = c(
penalty = 10,
mixture = 10
)
)
net_grid## # A tibble: 100 x 2
## penalty mixture
## <dbl> <dbl>
## 1 0.0001 0
## 2 0.000215 0
## 3 0.000464 0
## 4 0.001 0
## 5 0.00215 0
## 6 0.00464 0
## 7 0.01 0
## 8 0.0215 0
## 9 0.0464 0
## 10 0.1 0
## # ... with 90 more rows
A função tunegrid() roda a busca dentro deste grid de valores de parâmetros rodando o processo de cross validation de acordo com o que for passado para o parâmetro resample.
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
wf_logreg_tune_sample <- workflow() %>%
add_model(
tune_spec_logistic_reg
) %>%
add_recipe(receita_com_interacao_class)
res_logreg_tune_sample_optim <- wf_logreg_tune_sample %>%
tune_grid(
resamples = folds,
grid = net_grid,
control = control_grid(allow_par = TRUE)
)
stopCluster(cl)
res_logreg_tune_sample_optim## # Tuning results
## # 5-fold cross-validation repeated 2 times using stratification
## # A tibble: 10 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [2.9K/734]> Repeat1 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
## 2 <split [2.9K/734]> Repeat1 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
## 3 <split [2.9K/732]> Repeat1 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
## 4 <split [2.9K/732]> Repeat1 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 5 <split [2.9K/732]> Repeat1 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
## 6 <split [2.9K/734]> Repeat2 Fold1 <tibble [200 x 6]> <tibble [1 x 1]>
## 7 <split [2.9K/734]> Repeat2 Fold2 <tibble [200 x 6]> <tibble [1 x 1]>
## 8 <split [2.9K/732]> Repeat2 Fold3 <tibble [200 x 6]> <tibble [1 x 1]>
## 9 <split [2.9K/732]> Repeat2 Fold4 <tibble [200 x 6]> <tibble [1 x 1]>
## 10 <split [2.9K/732]> Repeat2 Fold5 <tibble [200 x 6]> <tibble [1 x 1]>
Uma forma de visualizar os resultados é com geom_tile, após rodar a função collect_metrics()
plot_result_tune <- function(results){
results %>%
collect_metrics() %>%
filter(
.metric == "roc_auc"
) %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = penalty,
y = mixture,
fill = ranque
)
) +
geom_shadowtext(
aes(
x = penalty,
y = mixture,
label = percent(mean, accuracy = .01),
),
size = 3,
color = "white",
bg.colour="black",
) +
scale_x_log10() +
scale_fill_gradient(low = "white", high = "darkgreen") +
theme_minimal() +
theme(
legend.position = "top"
)
}
plot_result_tune(res_logreg_tune_sample_optim)res_logreg_tune_sample_optim %>%
collect_metrics() %>%
filter(
.metric == "roc_auc"
) %>%
arrange(mean %>% desc()) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) %>%
fmt_number(
columns = vars(penalty, mixture),
n_sigfig = 2
)| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0.010 | 0.33 | roc_auc | binary | 85.32% | 10 | 0.30% | Preprocessor1_Model037 |
| 0.0046 | 0.67 | roc_auc | binary | 85.30% | 10 | 0.31% | Preprocessor1_Model066 |
| 0.0046 | 0.78 | roc_auc | binary | 85.30% | 10 | 0.31% | Preprocessor1_Model076 |
| 0.010 | 0.44 | roc_auc | binary | 85.29% | 10 | 0.29% | Preprocessor1_Model047 |
| 0.010 | 0.22 | roc_auc | binary | 85.29% | 10 | 0.31% | Preprocessor1_Model027 |
| 0.0046 | 0.56 | roc_auc | binary | 85.29% | 10 | 0.32% | Preprocessor1_Model056 |
| 0.0046 | 0.89 | roc_auc | binary | 85.28% | 10 | 0.30% | Preprocessor1_Model086 |
| 0.0022 | 1.0 | roc_auc | binary | 85.28% | 10 | 0.33% | Preprocessor1_Model095 |
| 0.0046 | 0.44 | roc_auc | binary | 85.26% | 10 | 0.32% | Preprocessor1_Model046 |
| 0.0046 | 1.0 | roc_auc | binary | 85.25% | 10 | 0.30% | Preprocessor1_Model096 |
| 0.0022 | 0.89 | roc_auc | binary | 85.25% | 10 | 0.33% | Preprocessor1_Model085 |
| 0.022 | 0.11 | roc_auc | binary | 85.24% | 10 | 0.29% | Preprocessor1_Model018 |
| 0.0022 | 0.78 | roc_auc | binary | 85.23% | 10 | 0.33% | Preprocessor1_Model075 |
| 0.022 | 0.22 | roc_auc | binary | 85.21% | 10 | 0.28% | Preprocessor1_Model028 |
| 0.010 | 0.56 | roc_auc | binary | 85.21% | 10 | 0.28% | Preprocessor1_Model057 |
| 0.0046 | 0.33 | roc_auc | binary | 85.21% | 10 | 0.32% | Preprocessor1_Model036 |
| 0.0022 | 0.67 | roc_auc | binary | 85.19% | 10 | 0.33% | Preprocessor1_Model065 |
| 0.010 | 0.11 | roc_auc | binary | 85.18% | 10 | 0.31% | Preprocessor1_Model017 |
| 0.0010 | 1.0 | roc_auc | binary | 85.16% | 10 | 0.33% | Preprocessor1_Model094 |
| 0.0022 | 0.56 | roc_auc | binary | 85.15% | 10 | 0.33% | Preprocessor1_Model055 |
| 0.0010 | 0.89 | roc_auc | binary | 85.14% | 10 | 0.32% | Preprocessor1_Model084 |
| 0.0046 | 0.22 | roc_auc | binary | 85.13% | 10 | 0.32% | Preprocessor1_Model026 |
| 0.0010 | 0.78 | roc_auc | binary | 85.13% | 10 | 0.32% | Preprocessor1_Model074 |
| 0.010 | 0.67 | roc_auc | binary | 85.12% | 10 | 0.28% | Preprocessor1_Model067 |
| 0.0022 | 0.44 | roc_auc | binary | 85.12% | 10 | 0.32% | Preprocessor1_Model045 |
| 0.0010 | 0.67 | roc_auc | binary | 85.11% | 10 | 0.31% | Preprocessor1_Model064 |
| 0.0022 | 0.33 | roc_auc | binary | 85.09% | 10 | 0.32% | Preprocessor1_Model035 |
| 0.0010 | 0.56 | roc_auc | binary | 85.09% | 10 | 0.31% | Preprocessor1_Model054 |
| 0.0046 | 0.11 | roc_auc | binary | 85.07% | 10 | 0.32% | Preprocessor1_Model016 |
| 0.00010 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model001 |
| 0.00022 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model002 |
| 0.00046 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model003 |
| 0.0010 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model004 |
| 0.0022 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model005 |
| 0.0046 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model006 |
| 0.010 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model007 |
| 0.022 | 0 | roc_auc | binary | 85.07% | 10 | 0.30% | Preprocessor1_Model008 |
| 0.0022 | 0.22 | roc_auc | binary | 85.06% | 10 | 0.31% | Preprocessor1_Model025 |
| 0.0010 | 0.44 | roc_auc | binary | 85.06% | 10 | 0.31% | Preprocessor1_Model044 |
| 0.046 | 0.11 | roc_auc | binary | 85.04% | 10 | 0.26% | Preprocessor1_Model019 |
| 0.022 | 0.33 | roc_auc | binary | 85.04% | 10 | 0.26% | Preprocessor1_Model038 |
| 0.046 | 0 | roc_auc | binary | 85.04% | 10 | 0.29% | Preprocessor1_Model009 |
| 0.00046 | 1.0 | roc_auc | binary | 85.03% | 10 | 0.30% | Preprocessor1_Model093 |
| 0.0010 | 0.33 | roc_auc | binary | 85.02% | 10 | 0.30% | Preprocessor1_Model034 |
| 0.010 | 0.78 | roc_auc | binary | 85.01% | 10 | 0.27% | Preprocessor1_Model077 |
| 0.0022 | 0.11 | roc_auc | binary | 85.01% | 10 | 0.31% | Preprocessor1_Model015 |
| 0.00046 | 0.89 | roc_auc | binary | 85.01% | 10 | 0.30% | Preprocessor1_Model083 |
| 0.00046 | 0.78 | roc_auc | binary | 85.00% | 10 | 0.30% | Preprocessor1_Model073 |
| 0.0010 | 0.22 | roc_auc | binary | 84.99% | 10 | 0.30% | Preprocessor1_Model024 |
| 0.00046 | 0.67 | roc_auc | binary | 84.98% | 10 | 0.30% | Preprocessor1_Model063 |
| 0.00046 | 0.56 | roc_auc | binary | 84.96% | 10 | 0.30% | Preprocessor1_Model053 |
| 0.00046 | 0.44 | roc_auc | binary | 84.94% | 10 | 0.30% | Preprocessor1_Model043 |
| 0.0010 | 0.11 | roc_auc | binary | 84.93% | 10 | 0.31% | Preprocessor1_Model014 |
| 0.00022 | 1.0 | roc_auc | binary | 84.91% | 10 | 0.30% | Preprocessor1_Model092 |
| 0.00046 | 0.33 | roc_auc | binary | 84.91% | 10 | 0.30% | Preprocessor1_Model033 |
| 0.00022 | 0.89 | roc_auc | binary | 84.90% | 10 | 0.30% | Preprocessor1_Model082 |
| 0.010 | 0.89 | roc_auc | binary | 84.90% | 10 | 0.26% | Preprocessor1_Model087 |
| 0.00022 | 0.78 | roc_auc | binary | 84.89% | 10 | 0.30% | Preprocessor1_Model072 |
| 0.00046 | 0.22 | roc_auc | binary | 84.88% | 10 | 0.31% | Preprocessor1_Model023 |
| 0.00022 | 0.67 | roc_auc | binary | 84.87% | 10 | 0.30% | Preprocessor1_Model062 |
| 0.00022 | 0.56 | roc_auc | binary | 84.85% | 10 | 0.30% | Preprocessor1_Model052 |
| 0.00046 | 0.11 | roc_auc | binary | 84.85% | 10 | 0.31% | Preprocessor1_Model013 |
| 0.10 | 0 | roc_auc | binary | 84.85% | 10 | 0.28% | Preprocessor1_Model010 |
| 0.00010 | 1.0 | roc_auc | binary | 84.84% | 10 | 0.31% | Preprocessor1_Model091 |
| 0.00022 | 0.44 | roc_auc | binary | 84.84% | 10 | 0.30% | Preprocessor1_Model042 |
| 0.00022 | 0.33 | roc_auc | binary | 84.83% | 10 | 0.30% | Preprocessor1_Model032 |
| 0.022 | 0.44 | roc_auc | binary | 84.82% | 10 | 0.25% | Preprocessor1_Model048 |
| 0.00022 | 0.22 | roc_auc | binary | 84.81% | 10 | 0.31% | Preprocessor1_Model022 |
| 0.00010 | 0.89 | roc_auc | binary | 84.81% | 10 | 0.30% | Preprocessor1_Model081 |
| 0.00010 | 0.78 | roc_auc | binary | 84.81% | 10 | 0.30% | Preprocessor1_Model071 |
| 0.00010 | 0.11 | roc_auc | binary | 84.80% | 10 | 0.31% | Preprocessor1_Model011 |
| 0.00022 | 0.11 | roc_auc | binary | 84.80% | 10 | 0.31% | Preprocessor1_Model012 |
| 0.00010 | 0.67 | roc_auc | binary | 84.79% | 10 | 0.31% | Preprocessor1_Model061 |
| 0.00010 | 0.56 | roc_auc | binary | 84.79% | 10 | 0.31% | Preprocessor1_Model051 |
| 0.00010 | 0.44 | roc_auc | binary | 84.78% | 10 | 0.31% | Preprocessor1_Model041 |
| 0.010 | 1.0 | roc_auc | binary | 84.78% | 10 | 0.26% | Preprocessor1_Model097 |
| 0.00010 | 0.33 | roc_auc | binary | 84.77% | 10 | 0.31% | Preprocessor1_Model031 |
| 0.00010 | 0.22 | roc_auc | binary | 84.77% | 10 | 0.31% | Preprocessor1_Model021 |
| 0.046 | 0.22 | roc_auc | binary | 84.61% | 10 | 0.24% | Preprocessor1_Model029 |
| 0.022 | 0.56 | roc_auc | binary | 84.56% | 10 | 0.23% | Preprocessor1_Model058 |
| 0.10 | 0.11 | roc_auc | binary | 84.33% | 10 | 0.23% | Preprocessor1_Model020 |
| 0.022 | 0.67 | roc_auc | binary | 84.33% | 10 | 0.22% | Preprocessor1_Model068 |
| 0.046 | 0.33 | roc_auc | binary | 84.20% | 10 | 0.21% | Preprocessor1_Model039 |
| 0.022 | 0.78 | roc_auc | binary | 84.10% | 10 | 0.21% | Preprocessor1_Model078 |
| 0.022 | 0.89 | roc_auc | binary | 83.85% | 10 | 0.20% | Preprocessor1_Model088 |
| 0.046 | 0.44 | roc_auc | binary | 83.80% | 10 | 0.19% | Preprocessor1_Model049 |
| 0.022 | 1.0 | roc_auc | binary | 83.61% | 10 | 0.19% | Preprocessor1_Model098 |
| 0.10 | 0.22 | roc_auc | binary | 83.58% | 10 | 0.21% | Preprocessor1_Model030 |
| 0.046 | 0.56 | roc_auc | binary | 83.41% | 10 | 0.18% | Preprocessor1_Model059 |
| 0.046 | 0.67 | roc_auc | binary | 83.05% | 10 | 0.17% | Preprocessor1_Model069 |
| 0.10 | 0.33 | roc_auc | binary | 82.99% | 10 | 0.19% | Preprocessor1_Model040 |
| 0.046 | 0.78 | roc_auc | binary | 82.67% | 10 | 0.17% | Preprocessor1_Model079 |
| 0.10 | 0.44 | roc_auc | binary | 82.57% | 10 | 0.18% | Preprocessor1_Model050 |
| 0.046 | 0.89 | roc_auc | binary | 82.32% | 10 | 0.18% | Preprocessor1_Model089 |
| 0.10 | 0.56 | roc_auc | binary | 82.22% | 10 | 0.19% | Preprocessor1_Model060 |
| 0.046 | 1.0 | roc_auc | binary | 82.13% | 10 | 0.19% | Preprocessor1_Model099 |
| 0.10 | 0.67 | roc_auc | binary | 81.83% | 10 | 0.22% | Preprocessor1_Model070 |
| 0.10 | 0.78 | roc_auc | binary | 81.57% | 10 | 0.23% | Preprocessor1_Model080 |
| 0.10 | 0.89 | roc_auc | binary | 81.29% | 10 | 0.24% | Preprocessor1_Model090 |
| 0.10 | 1.0 | roc_auc | binary | 80.90% | 10 | 0.23% | Preprocessor1_Model100 |
A árvore de decisão particiona o espaço formado pelas variáveis explicativas em subespaços baseando-se na “pureza” desses subespaços com relação à variável dependente.
Abaixo fazemos uma experiência com apenas 2 features contínuas e uma feature categórica com dois valores possíveis .
receita_arvore_decisao_demo <- recipe(
i_liked_partner ~
i_found_partner__attractive +
sex +
my_age,
data = dado_treino
)
arvore_decisao_mod <-
decision_tree(
tree_depth = 4,
min_n = 1,
cost_complexity = 0
) %>%
set_engine("rpart") %>%
set_mode("classification")
wf_arvore_decisao_demo <- workflow() %>%
add_recipe(receita_arvore_decisao_demo) %>%
add_model(arvore_decisao_mod)
fit_ad_demo <-
wf_arvore_decisao_demo %>%
fit(
data = dado_treino
)O algoritmo cira uma árvore de decisão como essa.
Ele escolhe, portanto, a ordem e os valores dos atributos que fatiarão a população. Seguindo a árvore até suas folhas, que são os nós sem filhos, podemos determinar a saída prevista de cada valor do vetor x.
Temos um espaço formado por duas features contínuas e uma feature categórica com duas categorias, podemos pensar este espaço como dois planos.
O algoritmo da árvore de decisão escolhe qual feature divide cada plano em duas partes mais puras no sentido da classificação (com mais pontos com a mesma categoria).
Como temos duas features contínuas e mais uma com duas categorias possíveis, isso é equivalente a dividir o espaço representado por dois planos com retas na vertical e horizontal.
valores_partner_atractive = tibble(i_found_partner__attractive = seq(0, 10, by = 0.1))
valores_my_age =
tibble(
my_age = seq(min(dado_treino$my_age), max(dado_treino$my_age), by = 1 )
)
valores_sex = tibble( sex = c("Homem", "Mulher") )
dados_novos <- crossing(
valores_partner_atractive,
valores_my_age,
valores_sex
)
predicoes_arvore <- predict(
object = fit_ad_demo,
new_data = dados_novos
)
predicoes_arvore_com_dados <- bind_cols(
dados_novos,
predicoes_arvore
)
ggplot(predicoes_arvore_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
geom_vline(
xintercept = 6.75
) +
geom_vline(
xintercept = 7.25
) +
geom_hline(
data = tibble(
my_age = 20.5,
sex = "Homem"
),
aes(
yintercept = my_age
)
) +
geom_hline(
data = tibble(
my_age = 28.5,
sex = "Mulher"
),
aes(
yintercept = my_age
)
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)log_reg_mod <-
logistic_reg(
) %>%
set_engine("glm") %>%
set_mode("classification")
wf_log_reg_demo <- workflow() %>%
add_recipe(receita_arvore_decisao_demo) %>%
add_model(log_reg_mod)
fit_log_reg_demo <-
wf_log_reg_demo %>%
fit(
data = dado_treino
)A título de comparação, podemos ver o mesmo esquema com a regressão logística
predicoes_log_reg <- predict(
object = fit_log_reg_demo,
new_data = dados_novos
)
predicoes_log_reg_com_dados <- bind_cols(
dados_novos,
predicoes_log_reg
)
ggplot(predicoes_log_reg_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)A forma como a árvore de decisão é criada faz com que ela tenha muita variância.
Cada decisão de particionamento é tomada a partir de características que podem ser muito específicas aos dados de treinamento.
Uma ideia usada nas Random Forests é criar \(trees\) conjuntos de treinamento a partir do conjunto original, mas retirando amostras de mesmo tamanho do conjunto original, com reposição. Além disso, cada vez que a árvore é particionada, a partição só pode acontecer em \(mtry\) das variáveis explicativas.
O resultado final é uma média da decisão dessas \(trees\) árvores.
Estas duas mudanças fazem com que o modelo tenha uma variância muito menor do que as árvore de decisão simples.
Abaixo podemos ver que o modelo oferece muito mais flexibilidade que as árvores de decisão simples.
predicoes_ranger <- predict(
object = fit_ranger_demo,
new_data = dados_novos
)
predicoes_ranger_com_dados <- bind_cols(
dados_novos,
predicoes_ranger
)
ggplot(predicoes_ranger_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)Abaixo montamos a nova configuração de modelo, com seus hiperparâmetros a serem tunados.
tune_spec_rand_forest <- rand_forest(
mtry = tune(),
trees = tune(),
min_n = tune()
) %>%
set_engine(
engine = "ranger"
) %>%
set_mode(
"classification"
)
tune_spec_rand_forest## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Computational engine: ranger
Vamos criar um primeiro grid para tentar otimizar o valor dos hiperparâmetros
rand_for_grid <- grid_regular(
mtry(range = c(3,50)),
trees(range = c(1,100)),
min_n(),
levels = 4
)
rand_for_grid## # A tibble: 64 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 3 1 2
## 2 18 1 2
## 3 34 1 2
## 4 50 1 2
## 5 3 34 2
## 6 18 34 2
## 7 34 34 2
## 8 50 34 2
## 9 3 67 2
## 10 18 67 2
## # ... with 54 more rows
wf_rand_for_tune_sample <- workflow() %>%
add_model(
tune_spec_rand_forest
) %>%
add_recipe(receita_com_interacao_class)all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_rand_for <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for <- read_rds("resultados/res_rand_for.rds" )
collect_metrics(res_rand_for) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 18 | 100 | 2 | roc_auc | binary | 88.03% | 10 | 0.48% | Preprocessor1_Model14 |
| 50 | 100 | 2 | roc_auc | binary | 87.72% | 10 | 0.49% | Preprocessor1_Model16 |
| 34 | 100 | 2 | roc_auc | binary | 87.67% | 10 | 0.46% | Preprocessor1_Model15 |
| 18 | 67 | 2 | roc_auc | binary | 87.67% | 10 | 0.47% | Preprocessor1_Model10 |
| 18 | 100 | 14 | roc_auc | binary | 87.60% | 10 | 0.50% | Preprocessor1_Model30 |
| 50 | 67 | 2 | roc_auc | binary | 87.44% | 10 | 0.49% | Preprocessor1_Model12 |
| 34 | 100 | 14 | roc_auc | binary | 87.43% | 10 | 0.50% | Preprocessor1_Model31 |
| 34 | 67 | 2 | roc_auc | binary | 87.43% | 10 | 0.44% | Preprocessor1_Model11 |
| 18 | 67 | 14 | roc_auc | binary | 87.41% | 10 | 0.47% | Preprocessor1_Model26 |
| 50 | 100 | 14 | roc_auc | binary | 87.34% | 10 | 0.43% | Preprocessor1_Model32 |
plot_result_tune_ranger <- function(results){
results %>%
map_df(
.f = collect_metrics
) %>%
filter(
.metric == "roc_auc"
) %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = mtry %>% factor(),
y = trees %>% factor(),
fill = ranque
)
) +
geom_shadowtext(
aes(
x = mtry %>% factor(),
y = trees %>% factor(),
label = percent(mean, accuracy = .01),
),
color = "white",
bg.colour="black",
size = 3,
) +
scale_fill_gradient(low = "white", high = "darkgreen") +
facet_wrap(
~min_n,
ncol = 1,
labeller = as_labeller( function(x){str_glue("min_n: {x}")} )
) +
theme_minimal() +
theme(
legend.position = "top"
) +
labs(
x = "mtry",
y = "trees"
)
}
plot_result_tune_ranger(list(res_rand_for))rand_for_grid_optim <- grid_regular(
mtry(range = c(4,45)),
trees(range = c(150,350)),
min_n(c(1,6)),
levels = 4
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_rand_for_optim <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid_optim,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for_optim <- read_rds("resultados/res_rand_for_optim.rds" )
collect_metrics(res_rand_for_optim) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 17 | 283 | 1 | roc_auc | binary | 88.29% | 10 | 0.46% | Preprocessor1_Model10 |
| 17 | 350 | 2 | roc_auc | binary | 88.19% | 10 | 0.45% | Preprocessor1_Model30 |
| 17 | 350 | 1 | roc_auc | binary | 88.18% | 10 | 0.44% | Preprocessor1_Model14 |
| 17 | 216 | 2 | roc_auc | binary | 88.18% | 10 | 0.45% | Preprocessor1_Model22 |
| 17 | 216 | 1 | roc_auc | binary | 88.13% | 10 | 0.47% | Preprocessor1_Model06 |
| 17 | 350 | 4 | roc_auc | binary | 88.12% | 10 | 0.46% | Preprocessor1_Model46 |
| 17 | 283 | 2 | roc_auc | binary | 88.12% | 10 | 0.44% | Preprocessor1_Model26 |
| 17 | 350 | 6 | roc_auc | binary | 88.10% | 10 | 0.43% | Preprocessor1_Model62 |
| 17 | 283 | 4 | roc_auc | binary | 88.09% | 10 | 0.46% | Preprocessor1_Model42 |
| 31 | 283 | 4 | roc_auc | binary | 88.07% | 10 | 0.45% | Preprocessor1_Model43 |
rand_for_grid_optim_2 <- grid_regular(
mtry(range = c(10,35)),
trees(range = c(350,500)),
min_n(range = c(1, 3)),
levels = c(mtry = 3, trees = 4, min_n = 3)
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_rand_for_optim_2 <- wf_rand_for_tune_sample %>%
tune_grid(
resamples = folds,
grid = rand_for_grid_optim_2,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)res_rand_for_optim_2 <- read_rds("resultados/res_rand_for_optim_2.rds" )
collect_metrics(res_rand_for_optim_2) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 10 | 500 | 1 | roc_auc | binary | 88.31% | 10 | 0.42% | Preprocessor1_Model10 |
| 22 | 500 | 1 | roc_auc | binary | 88.30% | 10 | 0.42% | Preprocessor1_Model11 |
| 10 | 500 | 2 | roc_auc | binary | 88.26% | 10 | 0.42% | Preprocessor1_Model22 |
| 22 | 350 | 1 | roc_auc | binary | 88.24% | 10 | 0.44% | Preprocessor1_Model02 |
| 10 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.43% | Preprocessor1_Model07 |
| 22 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.47% | Preprocessor1_Model08 |
| 22 | 400 | 1 | roc_auc | binary | 88.24% | 10 | 0.48% | Preprocessor1_Model05 |
| 10 | 400 | 2 | roc_auc | binary | 88.24% | 10 | 0.43% | Preprocessor1_Model16 |
| 22 | 500 | 3 | roc_auc | binary | 88.22% | 10 | 0.45% | Preprocessor1_Model35 |
| 22 | 500 | 2 | roc_auc | binary | 88.21% | 10 | 0.47% | Preprocessor1_Model23 |
rand_for_grid_optim_3 <- grid_regular(
mtry(range = c(8, 22)),
trees(range = c(400,500)),
min_n(range = c(1,1)),
levels = c(mtry = 5, trees = 5, min_n = 1)
)
rand_for_grid_optim_3## # A tibble: 25 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 8 400 1
## 2 11 400 1
## 3 15 400 1
## 4 18 400 1
## 5 22 400 1
## 6 8 425 1
## 7 11 425 1
## 8 15 425 1
## 9 18 425 1
## 10 22 425 1
## # ... with 15 more rows
res_rand_for_optim_3 <- read_rds("resultados/res_rand_for_optim_3.rds" )
collect_metrics(res_rand_for_optim_3) %>% arrange(desc(mean)) %>%
filter(
.metric == "roc_auc"
) %>%
head(
n = 10
) %>%
gt() %>%
fmt_percent(
columns = vars(mean, std_err)
) | mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|---|
| 22 | 475 | 1 | roc_auc | binary | 88.25% | 10 | 0.42% | Preprocessor1_Model20 |
| 18 | 500 | 1 | roc_auc | binary | 88.25% | 10 | 0.46% | Preprocessor1_Model24 |
| 15 | 450 | 1 | roc_auc | binary | 88.24% | 10 | 0.45% | Preprocessor1_Model13 |
| 15 | 500 | 1 | roc_auc | binary | 88.24% | 10 | 0.45% | Preprocessor1_Model23 |
| 18 | 475 | 1 | roc_auc | binary | 88.23% | 10 | 0.44% | Preprocessor1_Model19 |
| 11 | 450 | 1 | roc_auc | binary | 88.22% | 10 | 0.42% | Preprocessor1_Model12 |
| 15 | 425 | 1 | roc_auc | binary | 88.22% | 10 | 0.44% | Preprocessor1_Model08 |
| 11 | 500 | 1 | roc_auc | binary | 88.22% | 10 | 0.44% | Preprocessor1_Model22 |
| 18 | 425 | 1 | roc_auc | binary | 88.22% | 10 | 0.46% | Preprocessor1_Model09 |
| 18 | 400 | 1 | roc_auc | binary | 88.22% | 10 | 0.45% | Preprocessor1_Model04 |
plot_result_tune_ranger(
list(
res_rand_for,
res_rand_for_optim,
res_rand_for_optim_2,
res_rand_for_optim_3
)
)Apesar do hype que envolve as redes neurais, que faz elas parecerem mágicas e misteriosas, elas são métodos não lineares onde são aplicadas regressões lineares em cima de saída de outras regressões. Algumas dessas saídas são modificadas por funções de ativação não lineares, dando caráter não linear ao método. Os coeficientes dessas regressões, também chamados de pesos, são calibrados de forma a minimizar o erro através de um algoritmo inteligente chamado backpropagation.
Existem várias arquiteturas de redes neurais.
Vamos usar a arquitetura mais simples e mais usada: single hidden layer.
Como mostra a figura à esquerda, esse modelo tem uma camada de entrada, uma camada “escondida” intermediária e uma camada de saída.
\(p\) é o número de entradas, \(K\) o número de categorias de saída possíveis (no caso de classificação, no caso de regressão há uma saída) e \(M\), o número de neurônios na camada escondida.
O nome neurônio vem de uma simplificação de como funciona a célula. Assim como os nós da rede neural, a célula também recebe várias entradas e tem um processo de ativação (liga ou desliga) dependendo da intensidade da informação que rebebe nas entradas.
No entanto o cérebro humano é uma rede extremamente complexa com uma arquitetura que foi selecionada durante bilhões de anos.
Existem arquiteturas de rede neural mais complicadas, mas o número de pesos a serem calibrados é explosivo. Elas exigem uma quantidade massiva de dados. A área que estuda essa redes profundas se chama deep learning.
Vamos fazer o mesmo exercício que já fizemos com os outros modelos, treinando ela para apenas dois features contínuos e um categórico.
set.seed(555)
receita_nnet_demo <- recipe(
i_liked_partner ~
i_found_partner__attractive +
sex +
my_age,
data = dado_treino
) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
nnet_mod <-
mlp(
hidden_units = 10,
epochs = 1000
) %>%
set_engine("nnet") %>%
set_mode("classification")
wf_nnet_demo <- workflow() %>%
add_recipe(receita_nnet_demo) %>%
add_model(nnet_mod)
fit_nnet_demo <-
wf_nnet_demo %>%
fit(
data = dado_treino
)
fit_nnet_demo## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: mlp()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_center()
## * step_scale()
##
## -- Model -----------------------------------------------------------------------
## a 3-10-1 network with 51 weights
## inputs: i_found_partner__attractive sexHomem my_age
## output(s): ..y
## options were - entropy fitting
Podemos perceber que o modelo cria um padrão de classificação complexo, que consegue considerar a interação entre as entradas.
predicoes_nnet <- predict(
object = fit_nnet_demo,
new_data = dados_novos
)
predicoes_nnet_com_dados <- bind_cols(
dados_novos,
predicoes_nnet
)
ggplot(predicoes_nnet_com_dados) +
geom_tile(
aes(
x = i_found_partner__attractive,
y = my_age,
fill = .pred_class,
alpha = 0.1
)
) +
geom_jitter(
data = dado_treino,
width = 0.5,
aes(
x = i_found_partner__attractive,
y = my_age,
color = i_liked_partner
),
size = 0.8,
show.legend = FALSE
) +
facet_wrap(
~sex
) +
guides(
alpha = FALSE
) +
theme_minimal() +
scale_fill_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
scale_color_manual(
values = c(Liked = "darkgreen", Not = "darkred")
) +
theme(
legend.position = "top"
) +
labs(
fill = "",
x = "Achei bonito",
y = "Minha idade"
)hidden units é o número de neurônios na camada escondida. Quanto maior o número maior a flexibilidade da rede.
penalty é uma penalidade similar à da regressão do tipo ridge, também limita o número de pesos ativos.
epochs é o número de iterações usadas pra treinar a rede. Quanto mais iterações mais adaptada ao conjunto de entrada a rede estará.
## # A tibble: 27 x 3
## hidden_units penalty epochs
## <int> <dbl> <int>
## 1 1 0.0000000001 10
## 2 5 0.0000000001 10
## 3 10 0.0000000001 10
## 4 1 0.00001 10
## 5 5 0.00001 10
## 6 10 0.00001 10
## 7 1 1 10
## 8 5 1 10
## 9 10 1 10
## 10 1 0.0000000001 505
## # ... with 17 more rows
receita_com_interacao_class <- receita_com_interacao_class %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
tune_spec_nnet <- parsnip::mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()
) %>%
set_engine(
engine = "nnet",
MaxNWts = 3000
) %>%
set_mode(
"classification"
)
wf_nnet_tune_sample <- workflow() %>%
add_model(
tune_spec_nnet
) %>%
add_recipe(receita_com_interacao_class_center_scale)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)
write_rds(res_nnet, "resultados/res_nnet.rds")## # A tibble: 54 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 1.00e+ 0 1000 roc_auc binary 0.847 10 0.00281 Preproc~
## 2 1 1.00e+ 0 505 roc_auc binary 0.847 10 0.00251 Preproc~
## 3 10 1.00e+ 0 10 roc_auc binary 0.836 10 0.00390 Preproc~
## 4 10 1.00e+ 0 505 roc_auc binary 0.836 10 0.00354 Preproc~
## 5 10 1.00e-10 10 roc_auc binary 0.834 10 0.00352 Preproc~
## 6 10 1.00e+ 0 1000 roc_auc binary 0.831 10 0.00525 Preproc~
## 7 5 1.00e+ 0 10 roc_auc binary 0.828 10 0.00460 Preproc~
## 8 5 1.00e+ 0 505 roc_auc binary 0.828 10 0.00405 Preproc~
## 9 10 1.00e- 5 10 roc_auc binary 0.825 10 0.00558 Preproc~
## 10 1 1.00e+ 0 10 roc_auc binary 0.823 10 0.00557 Preproc~
## # ... with 44 more rows
plot_result_tune_nnet <- function(results){
results %>%
map_df(
.f = collect_metrics
) %>%
filter(
.metric == "roc_auc"
) %>%
group_by(
penalty, epochs, hidden_units
) %>%
summarise(
mean = mean(mean)
) %>%
ungroup() %>%
mutate(
ranque = rank(mean)
) %>%
ggplot() +
geom_tile(
aes(
x = penalty ,
y = epochs %>% factor(),
fill = ranque
)
) +
scale_x_log10() +
geom_shadowtext(
aes(
x = penalty,
y = epochs %>% factor(),
label = percent(mean, accuracy = .01),
),
color = "white",
bg.colour="black",
size = 3
) +
scale_fill_gradient(low = "white", high = "darkgreen") +
facet_wrap(
~hidden_units,
ncol = 1,
labeller = as_labeller( function(x){str_glue("hidden: {x}")} )
) +
theme_minimal() +
theme(
legend.position = "top"
)
}
plot_result_tune_nnet(list(res_nnet))nnet_grid_2 <- grid_regular(
hidden_units(range = c(1,10)),
penalty( range = c(-1,1)),
epochs(range = c(10, 1000)),
levels = c(hidden_units = 3, penalty = 2, epochs = 3)
)
nnet_grid_2## # A tibble: 18 x 3
## hidden_units penalty epochs
## <int> <dbl> <int>
## 1 1 0.1 10
## 2 5 0.1 10
## 3 10 0.1 10
## 4 1 10 10
## 5 5 10 10
## 6 10 10 10
## 7 1 0.1 505
## 8 5 0.1 505
## 9 10 0.1 505
## 10 1 10 505
## 11 5 10 505
## 12 10 10 505
## 13 1 0.1 1000
## 14 5 0.1 1000
## 15 10 0.1 1000
## 16 1 10 1000
## 17 5 10 1000
## 18 10 10 1000
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet_2 <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid_2,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)
write_rds(res_nnet_2, "resultados/res_nnet_2.rds")res_nnet_2 <- read_rds("resultados/res_nnet_2.rds")
collect_metrics(res_nnet_2) %>% arrange(desc(mean))## # A tibble: 36 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 10 1000 roc_auc binary 0.859 10 0.00308 Preproces~
## 2 10 10 505 roc_auc binary 0.859 10 0.00317 Preproces~
## 3 5 10 1000 roc_auc binary 0.858 10 0.00302 Preproces~
## 4 5 10 505 roc_auc binary 0.858 10 0.00342 Preproces~
## 5 1 10 505 roc_auc binary 0.849 10 0.00320 Preproces~
## 6 1 10 1000 roc_auc binary 0.848 10 0.00314 Preproces~
## 7 10 10 10 roc_auc binary 0.842 10 0.00455 Preproces~
## 8 5 10 10 roc_auc binary 0.838 10 0.00274 Preproces~
## 9 10 0.1 10 roc_auc binary 0.837 10 0.00372 Preproces~
## 10 1 0.1 505 roc_auc binary 0.832 10 0.00159 Preproces~
## # ... with 26 more rows
nnet_grid_3 <- grid_regular(
hidden_units(range = c(1,15)),
penalty( range = c(0,1)),
epochs(range = c(1500, 2000)),
levels = c(hidden_units = 4, penalty = 2, epochs = 2)
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet_3 <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid_3,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)
write_rds(res_nnet_3, "resultados/res_nnet_3.rds")res_nnet_3 <- read_rds("resultados/res_nnet_3.rds")
collect_metrics(res_nnet_3) %>% arrange(desc(mean))## # A tibble: 32 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 10 2000 roc_auc binary 0.859 10 0.00288 Preproces~
## 2 10 10 2000 roc_auc binary 0.859 10 0.00300 Preproces~
## 3 10 10 1500 roc_auc binary 0.859 10 0.00339 Preproces~
## 4 15 10 2000 roc_auc binary 0.859 10 0.00308 Preproces~
## 5 15 10 1500 roc_auc binary 0.858 10 0.00318 Preproces~
## 6 5 10 1500 roc_auc binary 0.858 10 0.00347 Preproces~
## 7 1 10 1500 roc_auc binary 0.849 10 0.00314 Preproces~
## 8 1 10 2000 roc_auc binary 0.848 10 0.00318 Preproces~
## 9 15 1 2000 roc_auc binary 0.847 10 0.00304 Preproces~
## 10 1 1 1500 roc_auc binary 0.846 10 0.00284 Preproces~
## # ... with 22 more rows
nnet_grid_4 <- grid_regular(
hidden_units(range = c(10,15)),
penalty( range = c(1,2)),
epochs(range = c(2500, 3500)),
levels = c(hidden_units = 2, penalty = 2, epochs = 3)
)
all_cores <- parallel::detectCores(logical = FALSE)
library(doParallel)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
res_nnet_4 <- wf_nnet_tune_sample %>%
tune_grid(
resamples = folds,
grid = nnet_grid_4,
control = control_grid(allow_par = TRUE, verbose = TRUE)
)
stopCluster(cl)
write_rds(res_nnet_4, "resultados/res_nnet_4.rds")res_nnet_4 <- read_rds("resultados/res_nnet_4.rds")
collect_metrics(res_nnet_4) %>% arrange(desc(mean))## # A tibble: 24 x 9
## hidden_units penalty epochs .metric .estimator mean n std_err .config
## <int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 10 2500 roc_auc binary 0.859 10 0.00328 Preproces~
## 2 15 10 2500 roc_auc binary 0.859 10 0.00323 Preproces~
## 3 10 10 3000 roc_auc binary 0.858 10 0.00319 Preproces~
## 4 15 10 3500 roc_auc binary 0.858 10 0.00306 Preproces~
## 5 15 10 3000 roc_auc binary 0.858 10 0.00314 Preproces~
## 6 10 10 3500 roc_auc binary 0.858 10 0.00308 Preproces~
## 7 10 100 3000 roc_auc binary 0.834 10 0.00428 Preproces~
## 8 10 100 2500 roc_auc binary 0.834 10 0.00428 Preproces~
## 9 10 100 3500 roc_auc binary 0.834 10 0.00427 Preproces~
## 10 15 100 2500 roc_auc binary 0.834 10 0.00426 Preproces~
## # ... with 14 more rows
Já experimentamos alguns modelos e otimizamos seus hiperparâmetros, testando-os no conjunto de validação, sempre fora do conjunto de treinamento.
O nosso melhor modelo foi uma das configurações de random forest.
Selecionamos esse modelo com select_best()
## # A tibble: 1 x 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 10 500 1 Preprocessor1_Model10
Então criamos um workflow baseado no workflow que estávamos usando mas adicionando a informação de que este é o modelo escolhido
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
##
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 10
## trees = 500
## min_n = 1
##
## Computational engine: ranger
Treinando o modelo com todos os dados de treinamento/validação
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 22 Recipe Steps
##
## * step_ordinalscore()
## * step_dummy()
## * step_mutate()
## * step_poly()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * step_interact()
## * ...
## * and 12 more steps.
##
## -- Model -----------------------------------------------------------------------
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10L, x), num.trees = ~500L, min.node.size = min_rows(~1L, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 3664
## Number of independent variables: 164
## Mtry: 10
## Target node size: 1
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1446971
Vamos avaliar o modelo fora da amostra usada pra validação.
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)
final_fit_ranger <-
final_wf_ranger %>%
last_fit(split_dado)
stopCluster(cl)
write_rds(final_fit_ranger, "resultados/final_fit_ranger.rds")Podemos perceber que o modelo conseguiu generalizar os resultados para os dados de teste. Uma boa notícia.
final_fit_ranger <- read_rds("resultados/final_fit_ranger.rds")
final_fit_ranger %>% collect_metrics() %>%
gt() %>%
fmt_percent(
columns = vars(.estimate)
)| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 80.84% | Preprocessor1_Model1 |
| roc_auc | binary | 89.26% | Preprocessor1_Model1 |
curva <- final_fit_ranger %>%
collect_predictions() %>%
roc_curve(
truth = i_liked_partner,
.pred_Liked
)
ponto_gatilho <- curva %>%
filter(
.threshold > 0.45
) %>%
slice_min(
n = 1, order_by = .threshold
)
curva %>% ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_point(
data = ponto_gatilho,
aes(x = 1 - specificity, y = sensitivity),
size = 3,
color = "darkblue"
) +
geom_text_repel(
data = ponto_gatilho,
aes(
x = 1 - specificity + 0.15,
y = sensitivity - 0.15,
label = str_glue("{sensitivity %>% percent(accuracy = 0.1)}/{(1 - specificity) %>% percent(accuracy = 0.1)}")
)
) +
geom_abline(lty = 3) +
coord_equal() +
theme_bw()As séries temporais possuem propriedades impõem desafios ao uso do arcabouço que vimos anteriormente. Estas propriedades levaram à criação de modelos específicos: os modelos autorregressivos do tipo ARIMA.
A principal delas é o fato de que as observações das séries temporais frequentemente não são independentes, ou seja \(x_t\) depende de \(x_{t-1}, x_{t-2}, ...\). Isso acontece para a maioria das séries que interessam pra nós.
É muito comum, também, lidarmos com séries não-estacionárias.
Existem alguns tipos de estacionariedade, mas, grosso modo, a série estacionária é aquela que mantém sua distribuição ao longo do tempo, ou seja, a distribuição de \(x_t\), portanto média e variância também, não depende de \(t\).
Lidamos com muitas séries onde a cada \(t\) as informações que chegam são incorporadas à série como alterações (relativas e não absolutas) no patamar que a série tinha em \(t-1\), como é o caso de preços de ativos (não derivativos). Estas séries temporais flutuam ao redor de patamares diferentes ao longo do tempo.
Tidyverts é um conjunto de bibliotecas tidy friendly para lidar com séries temporais.
É possível obter mais informações em tidyverts.org.
Além da documentação exposta neste site o autor disponibiliza um livro sobre séries temporais, que não é tão aprofundado quanto os livros clássicos do tema, como os do Box e do Hamilton, mas apresenta os conceitos adequadamente, além de estar disponível online e usar a tidyverts nos exemplos.
O código abaixo cria um tsibble, a estrutura usada na tidyverts para representar uma série temporal.
Vale a pena notar:
O uso de yearmonth() para criar uma coluna que representa um mês-calendário
na criação do tsibble, o uso do parâmetro index, que recebe a data que server como referência de tempo para a série
crime_rio <- read_csv2(
"dados/crime_rio/BaseDPEvolucaoMensalCisp.csv",
locale = locale(encoding = "latin1")
)
crime_estado <- crime_rio %>%
select(
mes_ano,
hom_doloso, lesao_corp_morte, latrocinio, hom_por_interv_policial, tentat_hom, lesao_corp_dolosa, estupro, hom_culposo, lesao_corp_culposa, roubo_comercio, roubo_residencia, roubo_veiculo, roubo_carga, roubo_transeunte, roubo_em_coletivo, roubo_banco, roubo_cx_eletronico, roubo_celular, roubo_conducao_saque, roubo_apos_saque, roubo_bicicleta, outros_roubos, total_roubos, furto_veiculos, furto_transeunte, furto_coletivo, furto_celular, furto_bicicleta, outros_furtos, total_furtos, sequestro, extorsao, sequestro_relampago, estelionato, apreensao_drogas, posse_drogas, trafico_drogas, apreensao_drogas_sem_autor, recuperacao_veiculos, apf, aaapai, cmp, cmba, ameaca, pessoas_desaparecidas, encontro_cadaver, encontro_ossada, pol_militares_mortos_serv, pol_civis_mortos_serv, indicador_letalidade, indicador_roubo_rua, indicador_roubo_veic, registro_ocorrencias, fase
) %>%
group_by(
mes_ano
) %>%
summarise(
across(
.cols = everything(),
.fns = ~sum(.x, na.rm = TRUE)
)
) %>%
ungroup() %>%
separate(
col = mes_ano,
sep = "m",
into = c("ano", "mes")
) %>%
mutate(
data = make_date(ano, mes, 1) %>% yearmonth()
) %>%
arrange(data) %>%
mutate(
across(
.cols = -c(ano, mes, data) ,
.fns = list(diff_log = function(x){c(NA,x %>% log() %>% diff())} )
)
) %>%
mutate(
across(
.cols = ends_with("diff_log"),
.fns = ~if_else(is.na(.x), 0, .x)
)
) %>%
relocate(
data,
hom_doloso,
hom_doloso_diff_log
)
crime_estado_ts <- as_tsibble(crime_estado, index = data)O tsibble é criado e entende que tem um dado a cada mês
## # A tsibble: 196 x 111 [1M]
## data hom_doloso hom_doloso_diff~ ano mes lesao_corp_morte latrocinio
## <mth> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 2003 jan 589 0 2003 1 6 20
## 2 2003 fev 583 -0.0102 2003 2 4 12
## 3 2003 mar 613 0.0502 2003 3 5 16
## 4 2003 abr 585 -0.0468 2003 4 11 14
## 5 2003 mai 599 0.0236 2003 5 6 20
## 6 2003 jun 546 -0.0926 2003 6 2 14
## 7 2003 jul 527 -0.0354 2003 7 8 17
## 8 2003 ago 500 -0.0526 2003 8 4 10
## 9 2003 set 512 0.0237 2003 9 4 16
## 10 2003 out 489 -0.0460 2003 10 1 16
## # ... with 186 more rows, and 104 more variables:
## # hom_por_interv_policial <dbl>, tentat_hom <dbl>, lesao_corp_dolosa <dbl>,
## # estupro <dbl>, hom_culposo <dbl>, lesao_corp_culposa <dbl>,
## # roubo_comercio <dbl>, roubo_residencia <dbl>, roubo_veiculo <dbl>,
## # roubo_carga <dbl>, roubo_transeunte <dbl>, roubo_em_coletivo <dbl>,
## # roubo_banco <dbl>, roubo_cx_eletronico <dbl>, roubo_celular <dbl>,
## # roubo_conducao_saque <dbl>, roubo_apos_saque <dbl>, roubo_bicicleta <int>,
## # outros_roubos <dbl>, total_roubos <dbl>, furto_veiculos <dbl>,
## # furto_transeunte <dbl>, furto_coletivo <dbl>, furto_celular <dbl>,
## # furto_bicicleta <int>, outros_furtos <dbl>, total_furtos <dbl>,
## # sequestro <dbl>, extorsao <dbl>, sequestro_relampago <dbl>,
## # estelionato <dbl>, apreensao_drogas <dbl>, posse_drogas <int>,
## # trafico_drogas <int>, apreensao_drogas_sem_autor <int>,
## # recuperacao_veiculos <dbl>, apf <int>, aaapai <int>, cmp <int>, cmba <int>,
## # ameaca <dbl>, pessoas_desaparecidas <dbl>, encontro_cadaver <dbl>,
## # encontro_ossada <dbl>, pol_militares_mortos_serv <dbl>,
## # pol_civis_mortos_serv <dbl>, indicador_letalidade <dbl>,
## # indicador_roubo_rua <dbl>, indicador_roubo_veic <dbl>,
## # registro_ocorrencias <dbl>, fase <dbl>, lesao_corp_morte_diff_log <dbl>,
## # latrocinio_diff_log <dbl>, hom_por_interv_policial_diff_log <dbl>,
## # tentat_hom_diff_log <dbl>, lesao_corp_dolosa_diff_log <dbl>,
## # estupro_diff_log <dbl>, hom_culposo_diff_log <dbl>,
## # lesao_corp_culposa_diff_log <dbl>, roubo_comercio_diff_log <dbl>,
## # roubo_residencia_diff_log <dbl>, roubo_veiculo_diff_log <dbl>,
## # roubo_carga_diff_log <dbl>, roubo_transeunte_diff_log <dbl>,
## # roubo_em_coletivo_diff_log <dbl>, roubo_banco_diff_log <dbl>,
## # roubo_cx_eletronico_diff_log <dbl>, roubo_celular_diff_log <dbl>,
## # roubo_conducao_saque_diff_log <dbl>, roubo_apos_saque_diff_log <dbl>,
## # roubo_bicicleta_diff_log <dbl>, outros_roubos_diff_log <dbl>,
## # total_roubos_diff_log <dbl>, furto_veiculos_diff_log <dbl>,
## # furto_transeunte_diff_log <dbl>, furto_coletivo_diff_log <dbl>,
## # furto_celular_diff_log <dbl>, furto_bicicleta_diff_log <dbl>,
## # outros_furtos_diff_log <dbl>, total_furtos_diff_log <dbl>,
## # sequestro_diff_log <dbl>, extorsao_diff_log <dbl>,
## # sequestro_relampago_diff_log <dbl>, estelionato_diff_log <dbl>,
## # apreensao_drogas_diff_log <dbl>, posse_drogas_diff_log <dbl>,
## # trafico_drogas_diff_log <dbl>, apreensao_drogas_sem_autor_diff_log <dbl>,
## # recuperacao_veiculos_diff_log <dbl>, apf_diff_log <dbl>,
## # aaapai_diff_log <dbl>, cmp_diff_log <dbl>, cmba_diff_log <dbl>,
## # ameaca_diff_log <dbl>, pessoas_desaparecidas_diff_log <dbl>,
## # encontro_cadaver_diff_log <dbl>, encontro_ossada_diff_log <dbl>,
## # pol_militares_mortos_serv_diff_log <dbl>,
## # pol_civis_mortos_serv_diff_log <dbl>, indicador_letalidade_diff_log <dbl>,
## # ...
Na série anterior, somamos a cada mês os crimes do estado todo. No código abaixo, criamos um tsibble que contém as séries de cada delegacia.
Isso é feito usando o parâmetro key.
crime_delegacia <- crime_rio %>%
select(
CISP, AISP, RISP, munic, Regiao,
mes_ano,
hom_doloso, lesao_corp_morte, latrocinio, hom_por_interv_policial, tentat_hom, lesao_corp_dolosa, estupro, hom_culposo, lesao_corp_culposa, roubo_comercio, roubo_residencia, roubo_veiculo, roubo_carga, roubo_transeunte, roubo_em_coletivo, roubo_banco, roubo_cx_eletronico, roubo_celular, roubo_conducao_saque, roubo_apos_saque, roubo_bicicleta, outros_roubos, total_roubos, furto_veiculos, furto_transeunte, furto_coletivo, furto_celular, furto_bicicleta, outros_furtos, total_furtos, sequestro, extorsao, sequestro_relampago, estelionato, apreensao_drogas, posse_drogas, trafico_drogas, apreensao_drogas_sem_autor, recuperacao_veiculos, apf, aaapai, cmp, cmba, ameaca, pessoas_desaparecidas, encontro_cadaver, encontro_ossada, pol_militares_mortos_serv, pol_civis_mortos_serv, indicador_letalidade, indicador_roubo_rua, indicador_roubo_veic, registro_ocorrencias, fase
) %>%
separate(
col = mes_ano,
sep = "m",
into = c("ano", "mes")
) %>%
mutate(
data = make_date(ano, mes, 1) %>% yearmonth()
)
crime_delegacias_ts <- as_tsibble(
x = crime_delegacia,
index = data,
key = c(CISP, AISP, RISP, munic, Regiao)
)
crime_delegacias_ts ## # A tsibble: 26,217 x 62 [1M]
## # Key: CISP, AISP, RISP, munic, Regiao [188]
## CISP AISP RISP munic Regiao ano mes hom_doloso lesao_corp_morte
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 5 1 Rio ~ Capit~ 2003 1 0 0
## 2 1 5 1 Rio ~ Capit~ 2003 2 0 0
## 3 1 5 1 Rio ~ Capit~ 2003 3 0 0
## 4 1 5 1 Rio ~ Capit~ 2003 4 0 0
## 5 1 5 1 Rio ~ Capit~ 2003 5 0 0
## 6 1 5 1 Rio ~ Capit~ 2003 6 0 0
## 7 1 5 1 Rio ~ Capit~ 2003 7 0 0
## 8 1 5 1 Rio ~ Capit~ 2003 8 0 0
## 9 1 5 1 Rio ~ Capit~ 2003 9 1 0
## 10 1 5 1 Rio ~ Capit~ 2003 10 0 0
## # ... with 26,207 more rows, and 53 more variables: latrocinio <dbl>,
## # hom_por_interv_policial <dbl>, tentat_hom <dbl>, lesao_corp_dolosa <dbl>,
## # estupro <dbl>, hom_culposo <dbl>, lesao_corp_culposa <dbl>,
## # roubo_comercio <dbl>, roubo_residencia <dbl>, roubo_veiculo <dbl>,
## # roubo_carga <dbl>, roubo_transeunte <dbl>, roubo_em_coletivo <dbl>,
## # roubo_banco <dbl>, roubo_cx_eletronico <dbl>, roubo_celular <dbl>,
## # roubo_conducao_saque <dbl>, roubo_apos_saque <dbl>, roubo_bicicleta <lgl>,
## # outros_roubos <dbl>, total_roubos <dbl>, furto_veiculos <dbl>,
## # furto_transeunte <dbl>, furto_coletivo <dbl>, furto_celular <dbl>,
## # furto_bicicleta <lgl>, outros_furtos <dbl>, total_furtos <dbl>,
## # sequestro <dbl>, extorsao <dbl>, sequestro_relampago <dbl>,
## # estelionato <dbl>, apreensao_drogas <dbl>, posse_drogas <lgl>,
## # trafico_drogas <lgl>, apreensao_drogas_sem_autor <lgl>,
## # recuperacao_veiculos <dbl>, apf <lgl>, aaapai <lgl>, cmp <lgl>, cmba <lgl>,
## # ameaca <dbl>, pessoas_desaparecidas <dbl>, encontro_cadaver <dbl>,
## # encontro_ossada <dbl>, pol_militares_mortos_serv <dbl>,
## # pol_civis_mortos_serv <dbl>, indicador_letalidade <dbl>,
## # indicador_roubo_rua <dbl>, indicador_roubo_veic <dbl>,
## # registro_ocorrencias <dbl>, fase <dbl>, data <mth>
feasts é uma biblioteca que ajuda a extrair características de uma série temporal.
Abaixo vemos os três gráficos que ela gera com a função gg_tsdisplay():
O gráfico da série em si,
O gráfico da ACF, ou seja, a função de autocorrelação. ACF(lag) é a correlação entre \(x_t\) e \(x_{t-lag}\). Vemos que existe uma forte correlação entre \(x_t\) e \(x_{t-1}\), o que, dependendo do nível de autocorrelação, pode ser característico de processos não estacionários. Se olharmos para série podemos perceber que ela vaga ao redor de um patamar que muda com o tempo. Isso também é característico de séries não estacionárias. Vamos ver em seguida um teste que nos diz se existe evidência de que a série não-estacionária
Um gráfico de sazonalidade. Podemos verificar que parece haver uma sazonalidade na série: ela atinge patamares maiores no início do ano e patamares menores no meio.
conjunto <- gg_tsdisplay(
crime_estado_ts,
y = hom_doloso
)
embeleza_gg_tsdisplay <- function(obj){
obj[[1]] <- obj[[1]] +
theme_minimal() +
scale_x_yearmonth(
date_breaks = "1 year"
) +
theme(
axis.text.x = element_text(angle = 90)
)
obj[[2]] <- obj[[2]] +
theme_minimal() +
scale_y_continuous(
limits = c(-1,1)
)
obj[[3]] <- obj[[3]] +
theme_minimal() +
scale_color_gradient(
low = "lightblue",
high = "darkblue",
labels = function(x){x+2003},
breaks = seq(0, 20, by = 2)
) +
theme(
axis.text.x = element_text(angle = 90)
)
obj
}
embeleza_gg_tsdisplay(conjunto)Existem testes que verificam a hipótese de que a série é estacionária.
No teste executado abaixo, Kwiatkowski, a hipótese nula é de que a série é não-estacionária (na verdade não-estacionária com a posibilidade de tendência constante).
Sendo assim p-value obtido mostra que a chance de se obter um resultado tão extremo quanto o observado se a hipótese nula for verdadeira é baixíssima e podemos rejeitar a hipótese nula.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 2.25 0.01
Uma transformação comum nas séries temporais que muitas vezes deriva uma série estacionária a partir de uma série não-estacionária é aplicar a diferença no log da série. Ou seja, a série \(y\) obtida a partir da série \(x\) é \(y_t = log(x_t) - log(x_{t-1})\).
A transformação faz sentido, pois:
Muitas vezes a informação em \(t\) se reflete na alteração do patamar de \(x_{t-1}\) para \(x_{t}\). Pense no retorno do preço de um ativo. Por isso a diferença.
Esta alteração no patamar muitas vezes é relativa ao patamar e não um número absoluto. mais uma vez pense no retorno do preço de um ativo, que medimos em percentual. Por isso o \(log\).
Uma característica interessante do log na base natural é que \(log(a) - log(b)\) se aproxima de \(\frac{a}{b}-1\), que é o retorno percentual. Esta aproximação funciona melhor quanto este número é pequeno.
Essa característica facilita a interpretação de alguns resultados.
crime_estado_ts %>%
mutate(
diff_perc = hom_doloso/lag(hom_doloso) - 1
) %>%
select(
data,
hom_doloso,
hom_doloso_diff_log,
diff_perc
) %>%
gt() %>%
fmt_percent(
columns = vars(hom_doloso_diff_log, diff_perc)
)| data | hom_doloso | hom_doloso_diff_log | diff_perc |
|---|---|---|---|
| 2003 jan | 589 | 0.00% | NA |
| 2003 fev | 583 | −1.02% | −1.02% |
| 2003 mar | 613 | 5.02% | 5.15% |
| 2003 abr | 585 | −4.68% | −4.57% |
| 2003 mai | 599 | 2.36% | 2.39% |
| 2003 jun | 546 | −9.26% | −8.85% |
| 2003 jul | 527 | −3.54% | −3.48% |
| 2003 ago | 500 | −5.26% | −5.12% |
| 2003 set | 512 | 2.37% | 2.40% |
| 2003 out | 489 | −4.60% | −4.49% |
| 2003 nov | 528 | 7.67% | 7.98% |
| 2003 dez | 553 | 4.63% | 4.73% |
| 2004 jan | 578 | 4.42% | 4.52% |
| 2004 fev | 540 | −6.80% | −6.57% |
| 2004 mar | 529 | −2.06% | −2.04% |
| 2004 abr | 514 | −2.88% | −2.84% |
| 2004 mai | 605 | 16.30% | 17.70% |
| 2004 jun | 502 | −18.66% | −17.02% |
| 2004 jul | 505 | 0.60% | 0.60% |
| 2004 ago | 521 | 3.12% | 3.17% |
| 2004 set | 507 | −2.72% | −2.69% |
| 2004 out | 522 | 2.92% | 2.96% |
| 2004 nov | 570 | 8.80% | 9.20% |
| 2004 dez | 545 | −4.49% | −4.39% |
| 2005 jan | 607 | 10.77% | 11.38% |
| 2005 fev | 619 | 1.96% | 1.98% |
| 2005 mar | 682 | 9.69% | 10.18% |
| 2005 abr | 526 | −25.97% | −22.87% |
| 2005 mai | 561 | 6.44% | 6.65% |
| 2005 jun | 488 | −13.94% | −13.01% |
| 2005 jul | 580 | 17.27% | 18.85% |
| 2005 ago | 562 | −3.15% | −3.10% |
| 2005 set | 533 | −5.30% | −5.16% |
| 2005 out | 503 | −5.79% | −5.63% |
| 2005 nov | 495 | −1.60% | −1.59% |
| 2005 dez | 464 | −6.47% | −6.26% |
| 2006 jan | 480 | 3.39% | 3.45% |
| 2006 fev | 521 | 8.20% | 8.54% |
| 2006 mar | 607 | 15.28% | 16.51% |
| 2006 abr | 579 | −4.72% | −4.61% |
| 2006 mai | 548 | −5.50% | −5.35% |
| 2006 jun | 475 | −14.30% | −13.32% |
| 2006 jul | 478 | 0.63% | 0.63% |
| 2006 ago | 471 | −1.48% | −1.46% |
| 2006 set | 521 | 10.09% | 10.62% |
| 2006 out | 552 | 5.78% | 5.95% |
| 2006 nov | 527 | −4.63% | −4.53% |
| 2006 dez | 564 | 6.79% | 7.02% |
| 2007 jan | 526 | −6.98% | −6.74% |
| 2007 fev | 486 | −7.91% | −7.60% |
| 2007 mar | 640 | 27.53% | 31.69% |
| 2007 abr | 572 | −11.23% | −10.62% |
| 2007 mai | 466 | −20.50% | −18.53% |
| 2007 jun | 445 | −4.61% | −4.51% |
| 2007 jul | 457 | 2.66% | 2.70% |
| 2007 ago | 524 | 13.68% | 14.66% |
| 2007 set | 447 | −15.89% | −14.69% |
| 2007 out | 486 | 8.37% | 8.72% |
| 2007 nov | 528 | 8.29% | 8.64% |
| 2007 dez | 556 | 5.17% | 5.30% |
| 2008 jan | 538 | −3.29% | −3.24% |
| 2008 fev | 505 | −6.33% | −6.13% |
| 2008 mar | 527 | 4.26% | 4.36% |
| 2008 abr | 475 | −10.39% | −9.87% |
| 2008 mai | 412 | −14.23% | −13.26% |
| 2008 jun | 402 | −2.46% | −2.43% |
| 2008 jul | 413 | 2.70% | 2.74% |
| 2008 ago | 430 | 4.03% | 4.12% |
| 2008 set | 435 | 1.16% | 1.16% |
| 2008 out | 557 | 24.72% | 28.05% |
| 2008 nov | 516 | −7.65% | −7.36% |
| 2008 dez | 507 | −1.76% | −1.74% |
| 2009 jan | 551 | 8.32% | 8.68% |
| 2009 fev | 556 | 0.90% | 0.91% |
| 2009 mar | 588 | 5.60% | 5.76% |
| 2009 abr | 542 | −8.15% | −7.82% |
| 2009 mai | 522 | −3.76% | −3.69% |
| 2009 jun | 439 | −17.32% | −15.90% |
| 2009 jul | 397 | −10.06% | −9.57% |
| 2009 ago | 432 | 8.45% | 8.82% |
| 2009 set | 433 | 0.23% | 0.23% |
| 2009 out | 419 | −3.29% | −3.23% |
| 2009 nov | 438 | 4.43% | 4.53% |
| 2009 dez | 476 | 8.32% | 8.68% |
| 2010 jan | 447 | −6.29% | −6.09% |
| 2010 fev | 473 | 5.65% | 5.82% |
| 2010 mar | 492 | 3.94% | 4.02% |
| 2010 abr | 432 | −13.01% | −12.20% |
| 2010 mai | 361 | −17.95% | −16.44% |
| 2010 jun | 347 | −3.96% | −3.88% |
| 2010 jul | 324 | −6.86% | −6.63% |
| 2010 ago | 344 | 5.99% | 6.17% |
| 2010 set | 360 | 4.55% | 4.65% |
| 2010 out | 406 | 12.02% | 12.78% |
| 2010 nov | 364 | −10.92% | −10.34% |
| 2010 dez | 417 | 13.59% | 14.56% |
| 2011 jan | 425 | 1.90% | 1.92% |
| 2011 fev | 368 | −14.40% | −13.41% |
| 2011 mar | 381 | 3.47% | 3.53% |
| 2011 abr | 403 | 5.61% | 5.77% |
| 2011 mai | 368 | −9.09% | −8.68% |
| 2011 jun | 307 | −18.12% | −16.58% |
| 2011 jul | 331 | 7.53% | 7.82% |
| 2011 ago | 371 | 11.41% | 12.08% |
| 2011 set | 323 | −13.85% | −12.94% |
| 2011 out | 318 | −1.56% | −1.55% |
| 2011 nov | 339 | 6.39% | 6.60% |
| 2011 dez | 345 | 1.75% | 1.77% |
| 2012 jan | 329 | −4.75% | −4.64% |
| 2012 fev | 395 | 18.28% | 20.06% |
| 2012 mar | 394 | −0.25% | −0.25% |
| 2012 abr | 342 | −14.15% | −13.20% |
| 2012 mai | 346 | 1.16% | 1.17% |
| 2012 jun | 318 | −8.44% | −8.09% |
| 2012 jul | 298 | −6.50% | −6.29% |
| 2012 ago | 294 | −1.35% | −1.34% |
| 2012 set | 331 | 11.85% | 12.59% |
| 2012 out | 314 | −5.27% | −5.14% |
| 2012 nov | 325 | 3.44% | 3.50% |
| 2012 dez | 395 | 19.51% | 21.54% |
| 2013 jan | 397 | 0.51% | 0.51% |
| 2013 fev | 389 | −2.04% | −2.02% |
| 2013 mar | 411 | 5.50% | 5.66% |
| 2013 abr | 417 | 1.45% | 1.46% |
| 2013 mai | 430 | 3.07% | 3.12% |
| 2013 jun | 362 | −17.21% | −15.81% |
| 2013 jul | 302 | −18.12% | −16.57% |
| 2013 ago | 407 | 29.84% | 34.77% |
| 2013 set | 378 | −7.39% | −7.13% |
| 2013 out | 377 | −0.26% | −0.26% |
| 2013 nov | 414 | 9.36% | 9.81% |
| 2013 dez | 461 | 10.75% | 11.35% |
| 2014 jan | 464 | 0.65% | 0.65% |
| 2014 fev | 482 | 3.81% | 3.88% |
| 2014 mar | 510 | 5.65% | 5.81% |
| 2014 abr | 449 | −12.74% | −11.96% |
| 2014 mai | 444 | −1.12% | −1.11% |
| 2014 jun | 377 | −16.36% | −15.09% |
| 2014 jul | 370 | −1.87% | −1.86% |
| 2014 ago | 373 | 0.81% | 0.81% |
| 2014 set | 345 | −7.80% | −7.51% |
| 2014 out | 375 | 8.34% | 8.70% |
| 2014 nov | 345 | −8.34% | −8.00% |
| 2014 dez | 408 | 16.77% | 18.26% |
| 2015 jan | 439 | 7.32% | 7.60% |
| 2015 fev | 326 | −29.76% | −25.74% |
| 2015 mar | 382 | 15.85% | 17.18% |
| 2015 abr | 339 | −11.94% | −11.26% |
| 2015 mai | 347 | 2.33% | 2.36% |
| 2015 jun | 272 | −24.35% | −21.61% |
| 2015 jul | 306 | 11.78% | 12.50% |
| 2015 ago | 336 | 9.35% | 9.80% |
| 2015 set | 351 | 4.37% | 4.46% |
| 2015 out | 380 | 7.94% | 8.26% |
| 2015 nov | 340 | −11.12% | −10.53% |
| 2015 dez | 382 | 11.65% | 12.35% |
| 2016 jan | 406 | 6.09% | 6.28% |
| 2016 fev | 404 | −0.49% | −0.49% |
| 2016 mar | 445 | 9.67% | 10.15% |
| 2016 abr | 475 | 6.52% | 6.74% |
| 2016 mai | 369 | −25.25% | −22.32% |
| 2016 jun | 373 | 1.08% | 1.08% |
| 2016 jul | 368 | −1.35% | −1.34% |
| 2016 ago | 387 | 5.03% | 5.16% |
| 2016 set | 425 | 9.37% | 9.82% |
| 2016 out | 465 | 8.99% | 9.41% |
| 2016 nov | 461 | −0.86% | −0.86% |
| 2016 dez | 464 | 0.65% | 0.65% |
| 2017 jan | 479 | 3.18% | 3.23% |
| 2017 fev | 503 | 4.89% | 5.01% |
| 2017 mar | 498 | −1.00% | −0.99% |
| 2017 abr | 436 | −13.30% | −12.45% |
| 2017 mai | 425 | −2.56% | −2.52% |
| 2017 jun | 390 | −8.59% | −8.24% |
| 2017 jul | 374 | −4.19% | −4.10% |
| 2017 ago | 397 | 5.97% | 6.15% |
| 2017 set | 458 | 14.29% | 15.37% |
| 2017 out | 486 | 5.93% | 6.11% |
| 2017 nov | 455 | −6.59% | −6.38% |
| 2017 dez | 445 | −2.22% | −2.20% |
| 2018 jan | 468 | 5.04% | 5.17% |
| 2018 fev | 441 | −5.94% | −5.77% |
| 2018 mar | 508 | 14.14% | 15.19% |
| 2018 abr | 476 | −6.51% | −6.30% |
| 2018 mai | 423 | −11.80% | −11.13% |
| 2018 jun | 375 | −12.04% | −11.35% |
| 2018 jul | 410 | 8.92% | 9.33% |
| 2018 ago | 360 | −13.01% | −12.20% |
| 2018 set | 382 | 5.93% | 6.11% |
| 2018 out | 383 | 0.26% | 0.26% |
| 2018 nov | 378 | −1.31% | −1.31% |
| 2018 dez | 346 | −8.85% | −8.47% |
| 2019 jan | 386 | 10.94% | 11.56% |
| 2019 fev | 319 | −19.06% | −17.36% |
| 2019 mar | 344 | 7.55% | 7.84% |
| 2019 abr | 356 | 3.43% | 3.49% |
A série de diferenças do log apresenta um média e variância mais constante visualmente.
As autocorrelações ainda estão presentes, mas têm magnitude bem menor. É possível notar uma autocorrelação com o lag 12, devida à sazonalidade.
serie_diff_log <- gg_tsdisplay(crime_estado_ts, y = hom_doloso_diff_log )
embeleza_gg_tsdisplay(serie_diff_log)O teste dessa vez não oferece evidências para descartarmos a hipótese nula de que a aérie é não-estacionária.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0212 0.1
A feasts oferece uma forma de decompor a série temporal em tendência, sazonalidade e ruído.
dcmp <- crime_estado_ts %>%
model(STL(hom_doloso ~ season(window = Inf)))
components(dcmp) %>% autoplot()dcmp <- crime_estado_ts %>%
model(STL(hom_doloso_diff_log ~ season(window = Inf)))
components(dcmp) %>% autoplot()Um modelo muito usado para previsão de séries temporais, que leva em conta e tira proveito da propriedade da autocorrelação, é o ARIMA.
ARIMA é uma junção dos modelos AR (autoregressivo) e MA (“médias móveis”, na verdade soma dos choques prévios). O I (integrated) fica por conta da transformação, que fizemos anteriormente, que pega a diferença entre elementos da série.
No modelo Autoregressivo (AR), a série é representada como um processo dependente dos elementos anteriores da própria série, mais um choque em \(t\). Um AR(p), portanto, é assim:
\[x_t = \phi_1 z_{t-1} + \phi_2 z_{t-2} + ... + \phi_p z_{t-p} + a_t \]
No modelo “Média Móvel” (MA), a série é representada como um processo dependente dos choques anteriores, mais um choque em \(t\). Um MA(q), portanto, é assim:
\[x_t = \theta_1 a_{t-1} + \theta_2 a_{t-2} + ... + \theta_q a_{t-q} + a_t \]
É possível descrever um processo AR com um procersso MA e vice-versa, mas isso exigiria uma quantidade infinita de coeficientes.
Por parcimônia, pois os coeficientes precisam ser estimados e cada estimativa eleva a incerteza, é comum conjugar os dois modelos em um ARMA(p,q):
\[x_t = \phi_1 z_{t-1} + \phi_2 z_{t-2} + ... + \phi_p z_{t-p} + \theta_1 a_{t-1} + \theta_2 a_{t-2} + ... + \theta_q a_{t-q} + a_t \]
Ao usar o ARMA na diferença entre elementos da série, ele vira ARIMA(p,d,q), onde o d é o número de vezes que se diferencia a série.
É possível adicionar P + Q termos autoregressivos e de média móvel referentes a multiplos do lag que representa o período m da série.
No caso de uma série mensal com sazonalidade anual, seriam adicionados termos AR e MA relativos aos lags 12, 24, 36, … e a diferenciação poderia ser feita no lag 12.
Alguns chamam este modelo com sazonalidade de SARIMA(p, d, q)(P, D, Q)[m]
A biblioteca fable ajuda a estimar alguns modelos usados para previsão de séries temporais.
Um dos modelos disponíveis é o SARIMA.
A especificação do modelo é escolhida de forma a obter um modelo parcimonioso. A métrica de desempenho usada (AIC) penaliza os modelos com mais coeficientes, portanto maior soma de p, q, P e Q.
O d e D são escolhidos de forma a serem os menores em que a série se apresenta estacionária.
Note que usamos o log, pois acreditamos que a série sofre alterações sempre relativas ao seu patamar anterior e não absolutas.
## Series: hom_doloso
## Model: ARIMA(1,1,2)(0,0,2)[12]
## Transformation: log(hom_doloso)
##
## Coefficients:
## ar1 ma1 ma2 sma1 sma2
## 0.8257 -1.171 0.2042 0.2223 0.2101
## s.e. 0.0888 0.126 0.1070 0.0774 0.0609
##
## sigma^2 estimated as 0.008022: log likelihood=195.44
## AIC=-378.89 AICc=-378.44 BIC=-359.25
A função accuracy mostra algumas métricas de perfomance NO CONJUNTO DE TREINAMENTO
## # A tibble: 1 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training -2.55 38.9 31.6 -1.22 7.27 0.587 0.587 0.00749
A função forecast da fabletools nos permite extrair as previsões do modelo para um horizonte.
Veja que os números já são transformados de volta para a unidade original da série
## # A fable: 12 x 4 [1M]
## # Key: .model [1]
## .model data hom_doloso .mean
## <chr> <mth> <dist> <dbl>
## 1 arima 2019 mai t(N(5.9, 0.008)) 358.
## 2 arima 2019 jun t(N(5.9, 0.011)) 350.
## 3 arima 2019 jul t(N(5.9, 0.014)) 360.
## 4 arima 2019 ago t(N(5.9, 0.016)) 355.
## 5 arima 2019 set t(N(5.9, 0.018)) 368.
## 6 arima 2019 out t(N(5.9, 0.019)) 373.
## 7 arima 2019 nov t(N(5.9, 0.02)) 370.
## 8 arima 2019 dez t(N(5.9, 0.021)) 364.
## 9 arima 2020 jan t(N(5.9, 0.022)) 377.
## 10 arima 2020 fev t(N(5.9, 0.023)) 356.
## 11 arima 2020 mar t(N(5.9, 0.023)) 372.
## 12 arima 2020 abr t(N(5.9, 0.024)) 375.
Existe uma função de autoplot(), da biblioteca feasts que mostra um gráfico incluindo o intervalo de confiança.
É possível adicionar outros regressores além dos termos relacionados ao ARIMA.
Imagina que acreditemos que o calor faz as pessoas ficarem mais nervosas a ponto de aumentar o número de homicídios.
Vamos pegar os dados de temperatura de uma estação na cidade do Rio de janeiro como proxy da temperatura do estado.
estacao_rio <- worldmet::getMeta(country = "BR")
dados_temp <- worldmet::importNOAA(code = "837460-99999", year = 2003:2020, hourly = TRUE)
write_rds(dados_temp, "dados/crime_rio/temperatura.rds")Vamos usar o percentil 67 da temperatura horária a cada mês
dados_temp <- read_rds("dados/crime_rio/temperatura.rds")
dados_temperatura_mensais_normais <- dados_temp %>%
mutate(
mes = month(date)
) %>%
group_by(
mes
) %>%
summarise(
p67_temp_normal = quantile(air_temp, probs = .5, na.rm = TRUE)
)
dados_temperatura_mensais <- dados_temp %>%
mutate(
ano_mes = yearmonth(date),
mes = month(date)
) %>%
group_by(
ano_mes,
mes
) %>%
summarise(
p67_temp = quantile(air_temp, probs = .67, na.rm = TRUE)
) %>%
left_join(
dados_temperatura_mensais_normais, by = c("mes")
) %>%
mutate(
desvio_temperatura = p67_temp - p67_temp_normal
) E juntar aos dados que já tínhamos
crime_estado_ts <- crime_estado_ts %>%
left_join(
dados_temperatura_mensais,
by = c("data" = "ano_mes")
)É possível rodar o modelo com essa variável.
O resultado estima que a cada grau de temperatura adicional, observamos 0,02 a mais no log da diferença da série (veja que d=1). Lembrando da equivalência entre log e delta percentual, podemos dizer que observamos aproximadamente 2% de aumento nos homicídios a cada grau de temperatura adicional.
modelo_temp <- crime_estado_ts %>%
model(arima = ARIMA(log(hom_doloso) ~ p67_temp ))
modelo_temp %>%
report()## Series: hom_doloso
## Model: LM w/ ARIMA(0,1,1)(0,0,2)[12] errors
## Transformation: log(hom_doloso)
##
## Coefficients:
## ma1 sma1 sma2 p67_temp
## -0.4882 0.1066 0.1446 0.0218
## s.e. 0.0682 0.0759 0.0622 0.0034
##
## sigma^2 estimated as 0.007119: log likelihood=207.02
## AIC=-404.04 AICc=-403.73 BIC=-387.68
O modelo oferece uma performance um pouco melhor, também, DENTRO DA AMOSTRA
## # A tibble: 1 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Training -1.04 37.5 30.1 -0.816 6.85 0.559 0.566 0.0412
É claro que isso é um modelo inexequível, pois não sabemos qual será a temperatura nos próximos meses, mas poderíamos traçar cenários ou usar um outro modelo para prever essa variável.
Aqui mostramos dois cenários, com o percentil 67 da temperatura em 26 e em 29
previsao_29 <- forecast(
modelo_temp,
h = 10,
new_data =
tsibble(
data =
seq.Date(
from = max(crime_estado_ts$data %>% as_date() %m+% months(1)),
length.out = 12,
by = "1 month") %>% yearmonth(),
p67_temp = replicate(expr = 29, n = 12),
index = data
)
)
previsao_26 <- forecast(
modelo_temp,
h = 10,
new_data =
tsibble(
data =
seq.Date(
from = max(crime_estado_ts$data %>% as_date() %m+% months(1)),
length.out = 12,
by = "1 month") %>% yearmonth(),
p67_temp = replicate(expr = 26, n = 12),
index = data
)
)ggplot() +
geom_line(
data = crime_estado_ts,
aes(
x = data,
y = hom_doloso
)
) +
geom_point(
data = crime_estado_ts,
aes(
x = data,
y = hom_doloso
)
) +
geom_line(
data = previsao_29,
aes(
x = data,
y = .mean
),
color = "red"
) +
geom_point(
data = previsao_29,
aes(
x = data,
y = .mean
),
color = "red"
) +
geom_line(
data = previsao_26,
aes(
x = data,
y = .mean
),
color = "blue"
) +
geom_point(
data = previsao_26,
aes(
x = data,
y = .mean
),
color = "blue"
) Assim como fizemos nos modelos que não eram de séries temporais, podemos dividir os dados em treinamento em teste.
Mas no caso das sérier, precisamos manter a ordem cronológica de cada um deles.
Podemos escolher uma data específica e separar os dados anteriores para treinamento e os posteriores para teste.
Primeiro o modelo sem temperatura
modelo_puro <- function(dado){
dado %>%
model(arima = ARIMA(log(hom_doloso) ))
}
crime_estado_ts_treino <- crime_estado_ts %>%
filter(
data < yearmonth("20190101")
)
fit_treino <- crime_estado_ts_treino %>%
modelo_puro
previsao <- fit_treino %>%
forecast(
crime_estado_ts %>% filter(data >= yearmonth("20190101"))
)
accuracy(previsao, crime_estado_ts)## # A tibble: 1 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -31.3 43.3 39.5 -9.50 11.6 0.754 0.673 -0.251
Depois o modelo com temperatura, que se sai melhor
modelo_com_temp <- function(dado){
dado %>%
model(arima = ARIMA(log(hom_doloso) ~ p67_temp ))
}
crime_estado_ts_treino <- crime_estado_ts %>%
filter(
data < yearmonth("20190101")
)
fit_treino <- crime_estado_ts_treino %>%
modelo_com_temp
previsao <- fit_treino %>%
forecast(
crime_estado_ts %>% filter(data >= yearmonth("20190101"))
)
accuracy(previsao, crime_estado_ts)## # A tibble: 1 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -30.7 36.4 30.7 -9.13 9.13 0.586 0.566 -0.292
Podemos rolar essa data de corte entre dados de treinamento e dados de validação, criando uma série com dados projetados sempre ao mesmo horizonte
Na imagem abaixo, o esquema é mostrado: a cada rodada escolhemos uma data de corte, que vai rolando, deixamos os dados anteriores como conjunto de treinamento e o dado imediatamente posterior como dado de validação (poderíamos escolher o dado sempre 6 meses posterior à data de corte, por exemplo se quiséssemos avaliar a performance neste horizonte).
Abaixo implementamos uma função que deve ser chamada a cada data de corte escolhida
previsao_fora_amostra <- function(modelo, data_corte, dado){
dado_treino <- dado %>%
filter(
data < yearmonth(data_corte)
)
fit_treino <- dado_treino %>%
modelo
previsao <- fit_treino %>%
forecast(
dado %>% filter(data >= yearmonth(data_corte)) %>% slice_min(1)
)
previsao %>%
select(
data, .mean
) %>%
as_tibble() %>%
select(
data, .mean
)
}Realizamos o procedimento para o modelo simples, sem temperatura
cortes <- seq.Date(
from = make_date(2013,1,1),
to = max(crime_estado_ts$data) %>% as.Date(),
by = "1 month"
)
plan(multiprocess)
previsoes_fora_1_puro <- cortes %>%
future_map_dfr(
.f = ~previsao_fora_amostra(modelo = modelo_puro, data_corte = .x, dado = crime_estado_ts),
.progress = TRUE
)
previsoes_fora_1_puro_ts <- previsoes_fora_1_puro %>%
mutate(
hom_doloso = .mean
) %>%
as_tsibble(
index = data
)
dados_reais <- crime_estado_ts %>%
filter(
data %in% (cortes %>% yearmonth())
)
performance_puro <- accuracy(previsoes_fora_1_puro_ts$hom_doloso, dados_reais$hom_doloso)
write_rds(performance_puro, "dados/crime_rio/performance_puro.rds")E para o modelo com temperatura
plan(multiprocess)
previsoes_fora_1_com_temp <- cortes %>%
future_map_dfr(
.f = ~previsao_fora_amostra(modelo = modelo_com_temp, data_corte = .x, dado = crime_estado_ts),
.progress = TRUE
)
previsoes_fora_1_com_temp_ts <- previsoes_fora_1_com_temp %>%
mutate(
hom_doloso = .mean
) %>%
as_tsibble(
index = data
)
dados_reais_com_temp <- crime_estado_ts %>%
filter(
data %in% (cortes %>% yearmonth())
)
performance_com_temp <- accuracy(previsoes_fora_1_com_temp_ts$hom_doloso, dados_reais_com_temp$hom_doloso)
write_rds(performance_com_temp, "dados/crime_rio/performance_com_temp.rds")O modelo com temperatura se mostrou melhor.
## ME RMSE MAE MPE MAPE
## Test set 0.8225141 37.14884 31.30003 -0.5059692 8.020367
## ME RMSE MAE MPE MAPE
## Test set 0.1528492 33.96707 27.68021 -0.5687739 7.140369
A biblioteca fable disponibiliza outros modelos, alguns deles bem simples.
É uma boa prática realizar o mesmo teste com estes modelos simples, de forma a identificar se realmente o modelo mais complexo está adicionando valor.
O modelo Naive com sazonalidade repete o valor do período anterior (considerando a sazonalidade) e adiciona um drift constante estimado para a série.
modelo_bobo <- function(dado){
dado %>%
model(snaive = SNAIVE(log(hom_doloso) ~ drift() + lag("year")))
}
plan(multiprocess)
previsoes_fora_1_bobo <- cortes %>%
future_map_dfr(
.f = ~previsao_fora_amostra(modelo = modelo_bobo, data_corte = .x, dado = crime_estado_ts),
.progress = TRUE
)
previsoes_fora_1_bobo_ts <- previsoes_fora_1_bobo %>%
mutate(
hom_doloso = .mean
) %>%
as_tsibble(
index = data
)
performance_bobo <- accuracy(previsoes_fora_1_bobo_ts$hom_doloso, dados_reais_com_temp$hom_doloso)
write_rds(performance_bobo, "dados/crime_rio/performance_bobo")O resultado é pior
## ME RMSE MAE MPE MAPE
## Test set 11.79164 71.99037 59.94568 1.559055 15.201
Este modelo é estimado com uma tendência e a sazonalidade
modelo_decomp <- function(dado){
dado %>%
model(lm = TSLM(log(hom_doloso) ~ trend() + season()))
}
plan(multiprocess)
previsoes_fora_1_decomp <- cortes %>%
future_map_dfr(
.f = ~previsao_fora_amostra(modelo = modelo_decomp, data_corte = .x, dado = crime_estado_ts),
.progress = TRUE
)
previsoes_fora_1_decomp_ts <- previsoes_fora_1_decomp %>%
mutate(
hom_doloso = .mean
) %>%
as_tsibble(
index = data
)
performance_decomp <- accuracy(previsoes_fora_1_decomp_ts$hom_doloso, dados_reais_com_temp$hom_doloso)
write_rds(performance_decomp, "dados/crime_rio/performance_tslm")O erro foi similar ao do modelo NAIVE
## ME RMSE MAE MPE MAPE
## Test set 50.96311 70.38793 60.22174 11.43432 14.2646